Abstract
5-Methylcytosine (5mC) is a widespread silencing mechanism that controls genomic parasites. In eukaryotes, 5mC has gained complex roles in gene regulation beyond parasite control, yet 5mC has also been lost in many lineages. The causes for 5mC retention and its genomic consequences are still poorly understood. Here, we show that the protist closely related to animals Amoebidium appalachense features both transposon and gene body methylation, a pattern reminiscent of invertebrates and plants. Unexpectedly, hypermethylated genomic regions in Amoebidium derive from viral insertions, including hundreds of endogenized giant viruses, contributing 14% of the proteome. Using a combination of inhibitors and genomic assays, we demonstrate that 5mC silences these giant virus insertions. Moreover, alternative Amoebidium isolates show polymorphic giant virus insertions, highlighting a dynamic process of infection, endogenization, and purging. Our results indicate that 5mC is critical for the controlled coexistence of newly acquired viral DNA into eukaryotic genomes, making Amoebidium a unique model to understand the hybrid origins of eukaryotic DNA.
INTRODUCTION
5-Methylcytosine (5mC) is a common base modification among eukaryotes (1–3). 5mC is deposited by DNA methyltransferases (DNMTs), a family of enzymes with ancestral families conserved throughout eukaryotes (4, 5). Some DNMTs are maintenance type enzymes, perpetuating 5mC patterns, including DNMT1 and DNMT5, while other DNMTs have de novo activity, such as DNMT3 (6, 7). However, the DNMT repertoire of an organism is not predictive of 5mC function. In some eukaryotes, including plants and animals, 5mC is associated with gene regulation, exemplified by gene body methylation, where 5mC positively correlates with gene transcriptional levels (1, 3, 8). However, the most widespread role of 5mC is in transposable element (TE) silencing, which is the assumed ancestral role in eukaryotes (9, 10).
Despite most attention being devoted to controlling endogenous parasitic elements, one of the first described functions of 5mC in eukaryotes was to silence retroviral insertions in mammals (11). Similarly, in bacteria, the main role of 5mC is to combat viruses (12). Therefore, controlling exogenous viral invasions is arguably as important as TE control for epigenetic silencing. It is increasingly recognized that many eukaryotic genes have viral origins, co-opted repeatedly throughout evolution (13). One of the most common sources for these acquisitions are giant viruses (Nucleocytoviricota). Giant viruses have a wide range of eukaryotic hosts and are present in almost all ecosystems, posing a widespread threat to eukaryotic cells (14, 15). Giant viruses are exceptional among viruses as they have enormous genomes (100 kb to 2.5 Mb) encoding many proteins thought to be eukaryotic hallmarks such as histones (14, 15). Giant viruses originated before modern eukaryotes, and they have been proposed to have contributed essential genes to eukaryogenesis (16–18). Furthermore, recent reports indicate that giant viruses can endogenize into extant eukaryotes (19–21). However, how this potentially lethal DNA is incorporated into eukaryotic genomes is currently not understood.
Finding a link between viral control and epigenetic regulation, however, is hampered by the scarcity of reported recent giant virus endogenizations (19, 22). Moreover, 5mC is evolutionarily very plastic, and many eukaryotic lineages have lost this epigenetic modification (1, 2), possibly because of its mutagenic potential and cytotoxic off-target effects of DNMTs (23). Furthermore, 5mC function varies across lineages. In fungi, 5mC is restricted to silencing TEs (24), whereas in invertebrates 5mC is usually restricted to gene bodies, and most TEs remain unmethylated (1, 2, 25). To expand our knowledge of 5mC systems and to unravel how a potentially ancestral fungal-like methylation pattern gave rise to the animal 5mC system, we focused on protists of the holozoan clade. These close animal relatives form four major lineages: choanoflagellates, filastereans, ichthyosporeans, and pluriformeans (Fig. 1A) (26, 27). In recent years, unicellular holozoan genomes have been shown to encode many genes previously thought to be unique to animals, informing the complex genomic nature of the unicellular ancestors of animals (26–29). However, none of these genomes encode DNMTs, suggesting an evolutionary loss of 5mC capacity (30). Here, we fill this gap by describing a unicellular relative of animals that has maintained 5mC, and unexpectedly find an unappreciated and potentially ancestral use of 5mC in regulating giant virus endogenizations.
RESULTS
The Amoebidium genome presents both gene body and TE methylation
To reconstruct the pre-animal roots of 5mC, we searched the available genomes and transcriptomes of unicellular holozoans for DNMT1 orthologs (29, 31), the maintenance DNMT in animals. We discovered that DNMT1 is expressed by Amoebidium appalachense (Fig. 1A), an ichthysoporean originally isolated from the cuticle of freshwater arthropods that can be grown axenically (32). Like other Ichthyosporea (33), Amoebidium are parasites or commensals of animal hosts, and show a coenocytic life cycle, with cells dividing their nuclei in a shared cytoplasm until forming a mature colony, which gives rise to unicellular and uninucleated cells (34). Yet, Amoebidium branches quite deep within the ichthyosporean lineage (Fig. 1A). Amoebidium DNMT1 sequence presents the same domain architecture of animal DNMT1 orthologs, including a zinc finger CXXC absent in non-holozoan sequences (fig. S1A), which suggests that this domain architecture was an innovation of holozoans despite DNMT1 being lost repeatedly. Additionally, we found DNMT1 orthologs in the metagenome-derived assemblies of the filastereans of the genus Pigoraptor (35), with the same domain architectures as Amoebidium (fig. S1A). Additionally, both Amoebidium and Pigoraptor encode UHRF1 orthologs with highly similar domain architectures to animal counterparts (fig. S1A), suggesting that DNMT1 and UHRF1 heterodimerization is conserved. Unlike Amoebidium, Pigoraptor species can only be grown at low densities in association with diverse bacteria and eukaryotic prey, the kinetoplastid Parabodo caudatus, making functional approaches and reliable methylation profiling challenging (36).
To fully characterize the gene repertoire and investigate Amoebidium 5mC patterns, we sequenced the genome of this species using a combination of nanopore long reads, Illumina short reads, and micro-C. We obtained a chromosome-scale assembly spanning 202 Mb, with 96.6% of the sequence contained in 18 chromosomes and a 95% BUSCO score (Fig. 1B and fig. S2, A and B). When examining Amoebidium’s DNMT repertoire, we identified a total of 18 DNMTs, comprising orthologs of DNMT1, DNMT3, DNMT2, Dim-2, and various lineage-specific clades (Fig. 1A and fig. S1, B and C). Notably, unlike DNMT1, none of the other DNMTs exhibit additional domains. Specifically, DNMT3 lacks the protein domains found in animal DNMT3, such as PWWP or ADD (fig. S1A) (7, 30). In contrast, none of the ichthyosporeans with genomic data available show DNMTs other than the RNA-specific DNMT2 and DNMT6 (Fig. 1A) (3, 7, 37). Among other holozoans, we could only find a putative DNMT3 in the transcriptome of the choanoflagellate Achanthoeca spectabilis (31) and both Pigoraptor species (35), with the same domain architecture lacking additional protein domains as in Amoebidium (fig. S1A). Thus, Amoebidium and Pigoraptor are the only sequenced unicellular holozoan species that retain the ancestral eukaryotic complement of DNMTs, highlighting the pervasive tendency of eukaryotes to lose 5mC.
Next, we performed whole-genome DNA methylation profiling to analyze the 5mC patterns in Amoebidium, as well as in three other ichthyosporean species lacking DNA DNMTs as negative controls. In Amoebidium, global methylation levels soar to 40%, exclusively within the CG dinucleotide context, setting it apart from most invertebrates and fungi and the other ichthyosporean species, which exhibit negligible levels of 5mC (Fig. 1B) (1, 24). Notably, not all CG dinucleotides exhibit uniform methylation levels. Specifically, the symmetrical mCGC and GmCG trinucleotides stand out with hypermethylation levels at around 70%, whereas the remaining CG dinucleotides maintain lower levels at approximately 20% (fig. S3A). This suggests that Amoebidium boasts elevated methylation levels with a wider sequence specificity beyond the CG dinucleotide, a context-dependent regionalization of 5mC reminiscent of heterochromatin methylation in mammals (38), likely reflecting the sequence preferences of the diverse Amoebidium DNMTs.
Considering the high global methylation levels in Amoebidium, we proceeded to investigate which genomic regions exhibit enriched 5mC. Protein-coding genes displayed a gene body methylation pattern reminiscent of plants and animals, with relatively low levels of promoter methylation (Fig. 1C) (39, 40). However, Amoebidium’s gene body methylation is not positively correlated with transcription as in plants or animals, as all active genes have similar methylation levels irrespectively of transcriptional level, whereas silent genes show higher methylation, including the promoter (Fig. 1C and fig. S3B). Therefore, gene body methylation appears not exclusive to animals in the holozoan clade, yet its positive association with transcription is an animal-specific feature potentially linked to the domain acquisitions of animal DNMT3s (fig. S1A).
In contrast to most invertebrates (1, 2), Amoebidium exhibits targeted methylation of TEs (Fig. 1D and fig. S3, B and C). Notably, methylation levels are highest in recent TE insertions and on transcriptionally silent genes (Fig. 1D and fig. S3C), regardless of the adjacent CG sequence context. In contrast, gene body methylation of actively transcribed genes primarily occurs within the CGC/GCG trinucleotide context (Fig. 1C). This indicates that in Amoebidium, 5mC of CGs in non-CGC/GCG trinucleotide context correlates with silencing, whereas CGC/GCG methylation is widespread. Further supporting the link between 5mC and TE silencing, approximately 50% of Amoebidium’s genome is composed of TEs, a level unmatched in any unicellular holozoans, yet similar to vertebrates such as humans (50%) or zebrafish (Fig. 1B and fig. S2, C and D). Therefore, the genome of Amoebidium is possibly permissive to TE expansions because 5mC can silence these novel insertions by reducing their potential deleterious effects, similar to what has been proposed for vertebrates (41),
Large hypermethylated regions uncover hundreds of viral insertions
To characterize the chromosome-level distribution of 5mC, we took advantage of the relative depletion of non-CGC/GCG methylation to locate regions of hypermethylation across the genome. We found many islands of hypermethylation spread across the chromosomes (Fig. 2A), many of which were consistent with regions of high TE content. However, many presented highly gene-rich areas spanning up to 200 kb, with most genes showing few to no introns, in clear contrast to the intron-rich Amoebidium genes (average 7.2 exon/gene; fig. S4A). Further characterization of these areas revealed core giant virus genes, including poxvirus late transcription factor (VLTF3), A32-like packaging adenosine triphosphatase (ATPase), D5 DNA primase, or nucleocytoplasmic large DNA viruses (NCLDV) major capsid proteins (fig. S4A) (14, 42). Using these core genes, we searched the National Center for Biotechnology Information (NCBI) database and performed phylogenetic analyses using curated databases of giant virus marker genes (42). We found that these insertions could be classified as belonging to a lineage belonging to the order pandoravirales, closely related to the Mamonoviridae family of Medusavirus and Clandestinovirus (infecting amoebozoans; fig. S5, A and B) (43). Yet, not all the giant endogenous viral elements (GEVEs) in the Amoebidium genome originate from a single catastrophic insertion event or even a single viral lineage, as they show high levels of sequence divergence among them (fig. S5C). Furthermore, there are insertions in almost all chromosomes (90 chromosomal insertions, with 42 sequences in unplaced contigs), some having accumulated secondary TE insertions (Fig. 2A). The disparity of insertion lengths, and the observation that none of them encode a full repertoire of core giant virus genes (Fig. 2B), suggests that complete viral genome integrations are rare or that gene loss occurs rapidly after insertion. Using ViralRecall to detect giant viruses solely based on sequence (44), we confirmed the presence of GEVEs in Amoebidium, yet we failed to recover any hits from other holozoans other than in Pigoraptor, which encodes for few giant virus markers, including a capsid protein or a VLTF3, yet these branch far from Amoebidium hits (fig. S5, A and B), suggesting that distant classes of giant viruses might endogenize into Pigoraptor genomes. However, the fragmented status and metagenomic source of Pigoraptor genome assemblies render confident GEVE identification problematic, as some might belong to viruses or other species found in the complex cultures.
In addition to the giant viruses, other compact hypermethylated regions in the Amoebidium genome were characterized by genes encoding VLTF3, Dam methyltransferase, minor and major capsid proteins, and a DNA polymerase family B (fig. S4B). DNA polymerase sequences produced closest matches to adintovirus (family Adintoviridae), a group of recently described double-stranded DNA polinton-related viruses thought to exclusively infect animals (fig. S5D) (45). It is worth noting that many polinton-related viruses are virophages or descendants of these (46, 47), known to parasitize giant viruses, which could explain the abundance of these sequences in the Amoebidium genome. Similarly to the giant viruses, not all adintoviruses were closely related among each other (fig. S5E), suggesting multiple independent insertion events. In contrast to GEVEs, some insertions kept long terminal repeats and were complete (~30 kb; Fig. 2B), yet others were truncated and in the process of degeneration.
Then, we identified a third type of giant repeat in Amoebidium, consisting of tandem clusters of repetitive intron-poor genes up to 50 kb long, usually flanked by a Plavaka transposase (fig. S4C) (48). Many of these genes encode for tyrosine recombinases, one of the major type of transposon integrases in eukaryotes (49), and interestingly their only hits in the NCBI NR database belong to very distant eukaryotic lineages including dinoflagellates or red algae, thus suggesting some form of lateral gene transfer as their source (fig. S4C). When we combine the three types of highly methylated giant repeats, they make up 3.1% of Amoebidium’s total DNA. Their contribution to the protein-coding genes constitutes 14% of the entire proteome, with the majority originating from viruses. The amount of giant virus insertions in Amoebidium is among the largest reported in eukaryotes, at par with the moss Physcomitrium patens (Fig. 2C) (50).
Endogenized giant virus co-opted eukaryotic histone demethylases
To understand the potential contribution of endogenized genes to the Amoebidium gene repertoire, and also to better understand the gene complement of the original giant virus genomes that infect Amoebidium, we characterized the functional enrichment of genes encoded in these endogenized regions. An enrichment test of Pfam domains revealed many domain categories involved in the viral replication and integration process [recombinases, integrases, proliferating cell nuclear antigen (PCNA)], viral gene regulation (transcription factors), or some transporters [e.g., aquaporins/Major Intrinsic Protein (MIP)], which are likely critical to taking control of the host during infection (Fig. 2D) (51–53). Gene ontologies also suggested that these genes were enriched in membrane fission or tubulin depolymerization (fig. S6A). Notably, some of the most enriched categories were involved in chromatin regulation. Among these, 10 of the 18 DNMTs encoded in the Amoebidium genome reside in GEVEs, which suggests that these could be used by the virus to modify its own DNA. Consistently, giant viruses, and members of the pandoravirales in particular, are known to use various forms of DNA methylation (N6-methyladenine and N4-methylcytosines) to methylate their own genomes (54), which might play a role in infection. However, the Amoebidium GEVE DNMTs form a sister group to other giant virus uncharacterized DNMTs (fig. S1B); thus, they were not recently acquired from the host and their sequence-substrate preferences remain unknown.
The most enriched endogenized domain is the Jumonji C (JmjC) domain. Although JmjC domains can perform many enzymatic functions, our phylogenetic analysis revealed that these are divergent paralogs of the histone lysine demethylase subfamily 4 (KDM4). Notably, despite JmjC-containing proteins having been identified in giant viruses (55), we could not find any KDM4-like JmjC homologs in publicly available giant virus genomes. Amoebidium encodes a canonical KDM4 ortholog like those of other eukaryotes, including its characteristic histone-interacting domains (PHD, Tudor; Fig. 2E). However, the endogenized KDM4-like enzymes only contain the enzymatic JmjC domain (Fig. 2E). KDM4 enzymes are known to demethylate histone 3 tail lysines, most commonly lysine 9 (H3K9me2/3) or lysine 36 (H3K36me2/3) residues. Although many giant viruses encode all four eukaryotic nucleosome histones (H2A/B,H3,H4) (42, 56), we did not find any in the viral insertions. Furthermore, viral histones present very divergent histone tails (57); thus, it is unlikely that KDM4-likes are used to control potential giant virus histones. Instead, given the conserved role of H3K9me3 in heterochromatin formation in eukaryotes, KDM4-like enzymes could be used by the virus to avoid silencing by the host chromatin. In KDM4-overexpressing cancer cells, depletion of H3K9me3 promotes DNA breaks and genome instability (58), a process that could serve the virus to integrate into the host genome, or explain the amount of endogenization events.
The KDM4-like enzymes stand out among the endogenized genes as they have preserved the multi-exon domain structure of eukaryotic genes (fig. S6C), unlike the vast majority of GEVE genes that lack introns. While most of the endogenized genes remain silent in culture conditions, four of these KDM4-like genes are transcribed [Transcripts Per Million (TPM) > 1] (fig. S6C). Moreover, JmjC genes are found in 39% (52) of the insertions, which could reflect a lower chance of purging those genes after the insertion event. A couple of KDM4-like genes are found outside hypermethylated giant virus regions and are flanked by normal host genes, showing almost exclusively mCGC/GmCG methylation (the default state for transcribed host genes; fig. S6C). Given their basal position in the phylogeny of KDM4-likes (Fig. 2E), these genes could be Amoebidium-specific KDM4 divergent paralogs that already lost the chromatin interaction domains compared to the canonical KDM4 copy, and were later co-opted by the giant viruses. Alternatively, the giant virus might have originally acquired a canonical KDM4 gene from the host, which then lost some of its companion domains to perform virus-associated functions. Then, these basally branching KDM4-like genes are the remnants of past GEVE insertions, where most other viral genes have been purged and only JmjC loci are kept, being domesticated to become part of the host repertoire. Thus, the intricate interaction between the host chromatin and the giant viruses is likely critical to explain the gene flow between the host and parasite.