Molecular Medicine Israel

Genome-wide repeat landscapes in cancer and cell-free DNA

Editor’s summary

Repetitive sequences make up much of the human genome and have been implicated in cancer development. Here, Annapragada et al. have taken advantage of the new telomere-to-telomere reference genome to evaluate changes in repeat elements from liquid biopsies. The authors have developed ARTEMIS (Analysis of RepeaT EleMents in dISease) to call these repeat elements from whole-genome sequencing and created a machine learning model that is able to predict disease in patients with early-stage lung or liver cancer. This promising new technique offers a new avenue for both cancer detection and monitoring that warrants further study. —Dorothy Hallberg

Abstract

Genetic changes in repetitive sequences are a hallmark of cancer and other diseases, but characterizing these has been challenging using standard sequencing approaches. We developed a de novo kmer finding approach, called ARTEMIS (Analysis of RepeaT EleMents in dISease), to identify repeat elements from whole-genome sequencing. Using this method, we analyzed 1.2 billion kmers in 2837 tissue and plasma samples from 1975 patients, including those with lung, breast, colorectal, ovarian, liver, gastric, head and neck, bladder, cervical, thyroid, or prostate cancer. We identified tumor-specific changes in these patients in 1280 repeat element types from the LINE, SINE, LTR, transposable element, and human satellite families. These included changes to known repeats and 820 elements that were not previously known to be altered in human cancer. Repeat elements were enriched in regions of driver genes, and their representation was altered by structural changes and epigenetic states. Machine learning analyses of genome-wide repeat landscapes and fragmentation profiles in cfDNA detected patients with early-stage lung or liver cancer in cross-validated and externally validated cohorts. In addition, these repeat landscapes could be used to noninvasively identify the tissue of origin of tumors. These analyses reveal widespread changes in repeat landscapes of human cancers and provide an approach for their detection and characterization that could benefit early detection and disease monitoring of patients with cancer.

INTRODUCTION

Genomic repeats comprise more than half the human genome and include a diverse set of elements that vary widely between individuals and exert key influences on genome structure and function (12). Because of technical limitations of short-read alignment and a reliance on incomplete genome assemblies, repeats have historically been neglected (3). Repetitive sequences are largely composed of tandem repeats and retrotransposons. Tandem repeats, such as human satellites, are usually concentrated in centromeres, telomeres, and the short arms of acrocentric chromosomes. Retrotransposons include diverse families of genome-wide repeats including long interspersed nuclear elements (LINEs), short interspersed nuclear elements (SINEs), long terminal repeats (LTRs), and other transposable elements (3). The recent completion of a telomere-to-telomere (T2T) genome has added nearly 200 Mb to the previous reference genome, revealed the genomic and epigenomic states of repeats, and revitalized study of these integral genomic regions (46).

Changes to repeat sequences have long been implicated in the development of cancer. Transposable elements are thought to modulate gene expression, and loss of their silencing by global hypomethylation in cancer may drive their movement (7), resulting in oncogene activation and genomic instability (8). Repeat types show differential enrichment in structural breakpoints. For example, tandem repeats are enriched at breakpoints of regions of copy number variation, whereas Alu repeats are enriched in deletion and duplication breakpoints (9). The above changes to repeat elements are broadly characteristic of cancer genomes, but changes to individual element types have been observed in different cancer types (1011). Transposable elements serve as active enhancers for tissue-specific transcription factors dysregulated in cancers, and canonical tandem repeat expansions are associated with gene regulation and vary substantially between tumor sites of origin (1213). In addition, instability and expansion of repeats in pericentromeric and centromeric regions in patients with cancer may drive chromosomal missegregation and other structural changes (1418) that have been associated with lower overall survival (19).

With the development of liquid biopsies for detection and genome-wide characterization of human cancer, analyses of repeat sequences have begun to be performed in cell-free DNA (cfDNA). Initial genome-wide analyses of cfDNA (20) did not specifically assess repeat elements. More recently, retrotransposable elements and nontelomeric satellite DNA have been shown to be highly represented in cfDNA (21) and have been used to evaluate overall cfDNA amounts or to assess aneuploidy (2224). Despite these advances, no systematic analysis of the compendium of repeat sequences has been performed in tissue or cfDNA of any human cancer, largely because of the inability to identify and quantify repeat sequences in a genome-wide fashion.

To address these challenges, we developed ARTEMIS (Analysis of RepeaT EleMents in dISease) as an alignment-free, genome-wide approach for analyzing repeat landscapes in short read sequencing. ARTEMIS assesses more than 1 billion short kmer sequences from 1280 individual repeat types that occur genome-wide and span 57 subfamilies comprising six families (satellites, RNA elements, transposable elements, LINEs, SINEs, and LTRs). In this study, we used ARTEMIS to show that repeat landscapes were broadly altered in human cancers, including in repeat elements not previously implicated in tumorigenesis. Repeat elements were enriched in regions of genes commonly altered in cancer, and tumor-specific changes in repeats reflected a combination of structural, copy number, and focal repeat alterations in the cancer genome. Changes in repeat landscapes were detectable in cfDNA and could be used for detection and monitoring of cancer and to identify the tissue of origin of tumors.

RESULTS

De novo search of kmers in genome-wide repeat elements

To develop ARTEMIS, we conducted a de novo search of short sequences (kmers) because we hypothesized that these would have enough complexity to identify the different types of repeat elements in the genome (Fig. 1, fig. S1, and table S1). For example, a 24–base pair (bp) kmer sequence can theoretically distinguish between 281 trillion (424) sequences. Using the recently obtained T2T reference genome (chm13) (525) assembled from long-read sequencing, we found that 4.73 billion 24-bp kmer sequences were present in the genome and 4.18 billion 24-bp kmers were unique to repeat elements overall. Because related repeat elements have diverged in their sequence composition over time, we identified 1.2 billion 24-bp kmers that uniquely defined each of 1266 recently identified repeat types. To be included in this set, a kmer could neither occur in nonrepeat regions of the genome nor occur in multiple repeat types. Each of the 1266 repeat types analyzed were defined by a median of 43,297 24-bp kmers spanning an average of 2.6 Mb of genome sequence (fig. S2). We further included 58,000 24-bp kmers from enhanced annotations of 14 human satellite subtypes (26). These 1.2 billion kmers representing 1280 repeat types were found on all chromosomes, and 98% of used kmers were only observed once in the T2T reference genome (fig. S3, A to C). These kmers also represented regions of the genome, such as human satellites, that could not be aligned with high quality in typical short-read next-generation sequencing. This allowed ARTEMIS to consider the entirety of the genome, rather than only that from the ~60 to 85% of reads from next-generation sequencing that can be aligned with high quality (2728). To verify that these repeat landscape kmers would not be confounded by human-associated microbial genomes (2930), we examined 1545 reference genomes representing common microbes and found a median of 100 ARTEMIS kmers per microbial genome (range, 0 to 1350), in all cases comprising <0.0002% of the 1.2 billion possible kmers counted in ARTEMIS. For analyses of an individual sample, we defined the kmer repeat landscape as the count of all kmers in a sequenced sample that matched each of the 1280 repeat types divided by the number of aligned sequence reads. Because changes in repeat sequences may occur during initiation of cancer and other diseases, this comprehensive compendium of repeat features can be used to distinguish genomes from normal and disease states.

Genome-wide enrichment of repeat kmers in cancer-related genes and pathways

We first examined the genome-wide distribution of the 1.2 billion kmers defining unique repeat types and found that repeat elements were enriched in genes commonly altered in human cancer (table S2). Of the 736 genes in the COSMIC cancer driver gene census (31), we found that 487 of these had a higher than expected number of repeat kmer sequences within their exonic or intronic sequences (normalized enrichment score = 9.12 and false discovery rate q = 0.00; fig. S3D), including those in genes amplified, deleted, and rearranged in cancer (normalized enrichment scores = 1.86, 4.04, and 6.71 and false discovery rate q = 0.01, 0.00, and 0.00, respectively; table S3). This enrichment remained significant even after correcting for the size of these genes (fig. S3D and table S3) and reflected an average 15-fold increase in repeat kmers in these regions (P < 2.2 × 10−16, Wilcoxon signed-rank test). In contrast, an analysis of the same number of randomly chosen genes in the genome did not show an enrichment of repeat kmer sequences (normalized enrichment score = −1.05 and false discovery rate q = 0.77). Repeat kmer sequences were also significantly increased in pathways commonly dysregulated in cancer including in cell adhesion, growth, and signaling, as well as cancer type–specific gene sets (false discovery rate q < 0.05; fig. S4). Together, these observations of repeat kmer localization suggest that alterations in key genes affecting oncogenic pathways in human cancer may be selected for during tumorigenesis using repeat-related genomic changes.

Kmer repeat landscapes are altered in cancer genomes

Given the broad number of genomic changes that occur during tumorigenesis, we evaluated whether kmer repeat landscapes were altered in cancers using short-read next-generation sequencing technologies. Given the challenges of distinguishing highly related repeat sequences, we simulated short-read whole-genome sequence data incorporating typical sequence error rates and analyzed these in an alignment-free fashion. We found that despite potential sequencing errors, the high complexity of kmer sequences allowed them to remain specific for their defined repeat family (98% of kmers counted were found in reads originating from their true repeat type) (figs. S5 and S6).

We analyzed matched tumor and normal tissues of 525 patients (table S4) from the Pan-Cancer Analysis of Whole Genomes (PCAWG) (32), including those with breast (n = 91), lung (n = 86), colorectal (n = 60), liver (n = 54), thyroid (n = 48), head and neck squamous cell (n = 44), ovarian (n = 42), gastric (n = 38), bladder (n = 23), cervical (n = 20), or prostate (n = 19) cancer and determined whether genome-wide kmer counts for specific repeat element types were altered in the tumors. An average of 22.4 billion total kmers were identified in each sample sequenced at 30 to 60× coverage, representing 1280 repeat elements. A median of 807 repeat elements (range, 246 to 1280) had increased or decreased kmer counts in tumors compared with their matched normal tissues (Fig. 2A and tables S5 and S6). Nearly two-thirds of altered elements (820 of 1280) had not been previously observed as being altered in human cancer (Fig. 2A and Table 1). Elements from satellites, LINEs, and SINEs were altered at the highest rates, although changes were also frequently observed in elements within LTRs, transposable elements, and RNA elements. Nearly a quarter of the elements studied came from the largest repeat subfamily of LTRs, ERV1s (endogenous retrovirus 1) (Table 1), which are hypothesized to aberrantly activate transcription in cancer cells by onco-exaptation, the process by which reactivated transposable elements can drive oncogene expression (33). On average, more than 40% of the 300 LTR ERV1 elements were altered in all 12 types of tumors studied, although the individual altered elements varied across tissue types. Although changes to 21 ERV1s have been described previously (34), we observed changes in an additional 279 ERV1s across the cancer types analyzed (tables S7 and S8). Similar to other large-scale changes in cancer genomes (3536), changes in kmer repeat landscapes were highly complex, with no two patients studied having the same set of alterations.

We hypothesized that changes in kmer repeat landscapes would in part be related to structural changes that arise during tumorigenesis such as chromosomal copy number changes, rearrangements, or focal amplifications or deletions. Accordingly, we found that kmer counts reflected chromosomal arm gains or losses genome-wide in a representative set of analyzed tumors (r = 0.81, P < 2.2 x 10−16, Spearman’s correlation) (fig. S7). In addition, tumors with more changes in kmer repeat landscapes had higher chromosomal instability as reflected through overall genomic entropy, loss of heterozygosity, nonmodal ploidy fraction of the genome, and other measures of genome-wide structural changes (r = 0.34, P = 2.1 × 10−14r = 0.29, P = 2.7 × 10−10r = 0.33, P = 1.2 × 10−14, respectively, Spearman’s correlation) (Fig. 2A). In contrast, tumor mutation burden, a measure of single-base sequence changes in an individual cancer, was only weakly correlated with genome-wide kmer repeat landscape changes (r = 0.15, P = 8.3 × 10−4, Spearman’s correlation).

Rearrangements, resulting from copy neutral translocations, as well as inversions, duplications, or deletions, may be facilitated by crossing over of homologous sequence (37). An analysis of the locations of repeat elements and tumor-specific sequence breakpoints in the 525 samples analyzed identified an enrichment of 215 elements at breakpoint locations, comprising LINEs, SINEs, LTRs, transposable elements, and RNA elements and including 128 elements that had never been previously implicated as being altered in cancer, suggesting that these elements may play a role in facilitating these structural changes (Fig. 2B). Analysis of focal amplifications of five or more copies in a subset of analyzed tumors revealed that repeat element content across all subfamilies correlated with an increase in amplicon copy number (r = 0.91, P < 2.2 × 10−16, Spearman’s correlation). As an example, analysis of the 1-Mb region surrounding ERBB2 in breast tumor with known gains at this region revealed significant increases in 14 repeat elements, including in eight elements with no previously documented changes in cancer (P < 0.05; Fig. 2C and fig. S8A). Similarly, gains of the ~30-Mb region on chromosome 3q containing driver genes PIK3CA and SOX2 in squamous cell lung cancer (3839) revealed increases in kmers for repeat elements overlapping these regions, including in nine elements not previously known to be altered in cancer (fig. S8, B and C).

Changes in the content of repeat landscapes were not fully explained by chromosomal or focal copy number changes or genomic rearrangements. After comparing changes in repeat elements to the segmented copy number alterations observed across the cancer genomes analyzed, we determined that 89% of the repeat changes (median, 693 changes per tumor; range, 232 to 1280) were larger in magnitude than would be expected because of copy number gains and losses alone (fig. S9 and table S9). A set of 236 elements exhibited changes not explained by copy number changes in at least 75% of tumors studied. These types of changes included reduction of kmer elements through LINE-1–mediated deletions in squamous cell lung cancers and lower-than-expected repeat content in regions of copy number gain, consistent with the concept that these repeat sequences may undergo deletion because they facilitate gains in nearby genomic content (Fig. 2D and figs. S7 and S10, A and B) (3740). Overall, these analyses highlight the ability of kmer repeat landscapes to detect and characterize a broad variety of structural changes in human cancer, including large chromosomal changes, commonly amplified or deleted driver gene regions, and alterations that directly target repeat sequences.

We next used a machine learning model to generate for each sample an ARTEMIS score, a single number that provides a quantitative summary of genome-wide repeat element changes predictive of disease state. Despite germline variability of repeat elements among different individuals (fig. S11), cross-validated ARTEMIS scores distinguished the 525 PCAWG tumors from normal tissues with high performance across all cancer types analyzed, regardless of the race of patients [overall area under the curve (AUC) = 0.96] (fig. S12).

To evaluate the potential clinical implications of changes in repeat elements of cancer genomes, we examined whether ARTEMIS scores for each tumor were associated with changes in overall survival or progression-free survival for patients with advanced cancers (stage III or IV, n = 167) in the PCAWG dataset. We found that increased ARTEMIS scores were associated with shorter overall (P < 0.001) and progression-free (P < 0.001) survival (Fig. 2E) and remained significant for progression-free survival even after adjusting for tumor type (P < 0.001; fig. S13). This change in patient outcomes was not observed for other nonrepeat genome-wide metrics, including genomic entropy, loss of heterozygosity, or nonmodal ploidy fraction (fig. S14, A and B). Given that the ARTEMIS score captures genome-wide changes to repeat landscapes, our observations are consistent with previous analyses indicating that reactivation and increase in repeat elements in cancer genomes may lead to increased immune responses (4143) or genomic instability (44), both mechanisms that could reduce tumor cell fitness and lead to improved patient outcomes.

Kmer repeat landscapes in cfDNA

We sought to determine whether our approach to characterizing the repeat landscape could be used for evaluating circulating cfDNA. Detection of repeats using low-coverage whole-genome sequencing would theoretically be achievable because ARTEMIS aggregates a large number of kmer-defined repeat element instances throughout the genome while maintaining sufficient granularity to identify disease-specific genomic features. As a first step in this analysis, we determined that repeat landscapes in PCAWG were highly consistent even if these were subsampled to different sequencing depths ranging from >60× to 1× coverage (figs. S15 and S16). We further found that kmer repeat landscapes in cfDNA were consistent across different sequencing platforms and experimental batches (fig. S17).

To determine whether repeat landscapes could be quantified in the plasma using low-coverage sequencing of cfDNA, we first examined satellite families with known distributions on the Y chromosome (chrY) in a collection of male and female individuals (n = 158). In the plasma of males (n = 87), kmer counts for human satellite types known to be found exclusively or predominantly on chrY were substantially higher than in females (n = 71) (P < 2.2 × 10−16 for all types) (Fig. 3A), whereas satellites not found on chrY showed no significant difference between males and females (P > 0.1 for all types).

We then identified repeat elements with the largest changes across the PCAWG tumors and evaluated the occurrences of these elements in cfDNA of individuals in prospectively collected diagnostic cohorts for patients at risk for lung or liver cancer (n = 287 for lung cancer cohort, table S10; n = 208 for liver cancer cohort, table S11) which had been previously sequenced (2745). Across cohorts, many of the repeat kmer increases or decreases observed in tumors were evident in the plasma of patients with lung cancer of squamous or adenocarcinoma subtypes or liver cancer as compared to plasma from individuals without cancer (Fig. 3B). These included changes not only in elements with previously documented roles in cancer such as LINE L1 elements but also newly identified elements that have now been revealed to have alterations in cancer, including from subfamilies such as DNA-hAT-Charlie and LTR ERV1, ERVL-MaLR, and ERVL.

We hypothesized that repeat landscapes in cfDNA could be different from the expected repeat content in genomic DNA due to genome-wide chromatin and epigenetic changes that may alter the representation of cfDNA fragments in the circulation (27284549). We have previously shown that cfDNA fragmentation profiles reflect open and closed chromatin states genome-wide (2845). Here, we analyzed cfDNA from 158 individuals without cancer and showed that regions with different histone marks had differential density of repeat element types (Fig. 4, A and B) and that individual cfDNA fragments derived from regions with actively transcribed chromatin or activated histone marks had shorter lengths and exhibited lower coverage in the plasma (Fig. 4, C and D). Overall, repeat landscape kmer counts in cfDNA for regions with high density of activating chromatin histone marks were lower than for regions with low density of these marks, whereas the reverse was observed for repressive histone marks (Fig. 4E). Genome-wide simulations suggest that repeat landscapes in cfDNA may be influenced by both tumor-specific epigenomic and genomic changes (fig. S18).

RTEMIS kmer repeat landscape analyses for cancer detection and monitoring

Given the ability to identify repeat landscapes changes in cfDNA, we evaluated the potential of the ARTEMIS method for noninvasive detection of cancer (fig. S19). We previously described use of a sensitive and accessible whole-genome cfDNA fragmentation test (DELFI, DNA evaluation of fragments for early interception) for lung and liver cancer screening in high-risk populations (2745). Here, we used the kmer repeat landscapes and epigenetic profiles in cfDNA regions with high density of histone marks that differentially affect repeat representation (fig. S20 and table S12) as features in machine learning models to detect lung cancer in the Danish Lung Cancer Screening Study (LUCAS) prospectively collected diagnostic cohort (n = 287) and liver cancer in a high-risk population (n = 208) (2745). ARTEMIS classified patients with lung cancer with an AUC of 0.82 [95% confidence interval (CI), 0.78 to 0.87], and when ensembled with the DELFI genome-wide fragmentation features (28), a joint ARTEMIS-DELFI model classified patients with lung cancer with an AUC of 0.91 (95% CI, 0.88 to 0.94) (Fig. 5, A and B, and fig. S21). Similar performance was observed in the cohort of individuals at risk for liver cancer, where ARTEMIS detected individuals with liver cancer among patients with cirrhosis or viral hepatitis with an AUC of 0.87 (95% CI, 0.82 to 0.93), and when combined with DELFI, the AUC improved to 0.90 (95% CI, 0.86 to 0.94) (fig. S22). We validated the locked ARTEMIS and ARTEMIS-DELFI models in an external cohort (table S13) composed of noncancer individuals at high and average risk of lung cancer (n = 400) and patients with all stages of lung cancer (n = 88) and observed similar performance to that in the cross-validated training cohort (Fig. 5C and fig. S23). Analysis of a separate held-out set of patients from the LUCAS cohort with a prior history of cancer (n = 25; table S14) using the locked ARTEMIS and ARTEMIS-DELFI models revealed higher scores in patients who experienced cancer recurrence compared with those who did not (fig. S24). We further applied these models to an independent cohort of patients with late-stage lung cancer (n = 19; table S15) receiving tyrosine kinase inhibitor therapy (50) and demonstrated that the ARTEMIS and joint ARTEMIS-DELFI scores were correlated to circulating tumor DNA mutant allele fractions observed during therapy (r = 0.70, P = 2.67 × 10−12 for ARTEMIS and r = 0.80, P < 2.2 × 10−16 for ARTEMIS-DELFI, Spearman’s correlation). Analysis of ARTEMIS-DELFI scores in patients at the first time point after initiation of treatment (median, 6 days) identified that those with scores above or below the pretreatment median had shorter or longer progression-free survival, respectively (median, 1.4 months for patients in the high-score group versus 8.9 months for low-score group; P < 0.001, log-rank test, two-sided) (fig. S25).

Lastly, given the observation of tumor-specific changes in repeat landscapes, we evaluated whether ARTEMIS could aid tissue of origin determination in tumor or cfDNA samples of patients with cancer. We first examined whether kmer repeat landscapes could capture a tissue-specific signal. We trained a machine learning model on the PCAWG cohort using kmer repeat landscapes to differentiate between tissue types and found that it classified the tumors by tissue of origin with an average of 78% accuracy among 12 tumor types studied despite relying only on genomic features, which are typically thought to show fewer tissue-specific differences than transcriptomic and epigenetic features (table S16). This is consistent with the observations that although changes in repeat landscapes are a pan-cancer feature of cancer genomes, the specific repeat elements altered vary between tumor types (Fig. 2A and tables S7 to S9). We then extended this approach to cfDNA, cross-validating ARTEMIS-DELFI within a multicancer cohort including 226 individuals with breast, ovarian, lung, colorectal, bile duct, gastric, and pancreatic tumors (table S17) (28). Despite the small number of samples available for training, we found that ARTEMIS-DELFI correctly categorized detected patients among the different cancer types with an average of 68 or 83% accuracy, for the highest or top two predictions, respectively (Table 2 and table S18).

DISCUSSION

In this study, we show that ARTEMIS can reconstruct genome-wide repeat landscapes that reflect underlying changes in human cancer. The alterations reflect structural changes in the cancer genome and direct changes in repeat elements. Through these analyses, we found that repeat elements were enriched in the genome in genes commonly altered in human cancer, including at specific tumor-derived rearrangement breakpoints. Cancer-specific changes of the repeat landscape were observed genome-wide, including in elements not previously known to be altered in human cancer. These elements may provide an underlying basis for structural alterations and the genomic instability of genes, pathways, and chromosomes widely altered in human cancers. In addition, the expansion or contraction of repeat elements that can now be comprehensively identified provides a new way to detect and examine mechanisms affecting cancer and other diseases.

We found that changes in repeat landscapes were detectable in the circulation and that the signal in plasma was further altered by epigenetic changes to repeat elements that influence their susceptibility to fragmentation. We and others have previously shown that changes in chromatin accessibility, transcription factor binding, and methylation can alter the representation of cfDNA in the blood (284548). In this study, we show that epigenetic states affected by histone acetylation and methylation, leading to altered gene expression, have a profound impact on the size and coverage of cfDNA at distinct regions genome-wide, including in repeat regions. These analyses suggest that kmer repeat landscapes in plasma can reveal both structural and epigenetic changes in the genome.

Our study has some limitations. Alhough we externally validated ARTEMIS for cancer detection in four cohorts consisting of 532 patients with and without lung cancer, it will be critical to validate this approach in larger screening populations and for other applications. Another limitation of ARTEMIS is that it relies on evaluation of changes in repeat landscapes that are inherently variable among the germlines of individuals (3145153). In addition, certain repeat regions of the genome, including low-complexity repeats and highly polymorphic regions, would not be fully analyzed through this approach. Although ARTEMIS reveals genome-wide changes in repeat landscapes, the specific location of changes in repeat elements may not be directly identified through this approach. In the future, it will be valuable to characterize kmer repeat landscapes across diverse individuals because the current chm13 reference genome is from a single individual, and comparisons to a representative panel of healthy genotypes of different germline backgrounds could improve performance. Moreover, the functional impacts of changes in repeat families remain poorly understood and could be improved through further analyses in cancer and other disease states.

Repeat landscape analyses for cfDNA-based detection of lung, liver, and other cancers suggest that ARTEMIS alone or in combination with other genome-wide features may provide an avenue for noninvasive detection, monitoring, and tissue of origin determination of cancer. ARTEMIS may improve early-stage diagnosis by identifying genome-wide changes that would perhaps not be evident in other liquid biopsy approaches when tumor features such as mutations or chromosomal arm changes are not detected. Only 44% of the genome-wide occurrences of kmers used in the ARTEMIS method are within known genes, and many of the repeat types in our landscapes have not been studied in human cancers. Given the size, diversity, and potential clinical relevance of these regions of the genome, our study offers unique insights into the cancer genome and provides a proof of concept for the utility of genome-wide kmer repeat landscapes as tissue- and blood-based biomarkers for cancer detection, characterization, and monitoring.

MATERIALS AND METHODS

Study design

Our objective was to characterize repeat landscapes in whole-genome sequencing of tissue and cfDNA from individuals with and without cancer. We used these to characterize cancer-related changes in repeat landscapes and to develop liquid biopsy approaches for cancer detection, monitoring, and tissue of origin classification. We obtained matched tumor and normal BAM files for the 539 patients with lung, liver, ovarian, colorectal, breast, thyroid, head and neck squamous, prostate, cervical, bladder, and gastric cancers in PCAWG that were available in the Protected Data Cloud (32) and excluded 14 patients for which either the tumor or normal was on the PCAWG blacklist (table S4).

We analyzed whole-genome sequencing (1 to 2× coverage) data from cfDNA from 819 individuals with and without lung cancer from four cohorts; 208 individuals with and without liver cancer; and 423 individuals from a multicancer cohort of patients without cancer and with breast, ovarian, lung, bile duct, colorectal, gastric, duodenal, and pancreatic tumors described in our previous publications (27284550) (fig. S19; tables S10, S11, S13, S14, S15, and S17; and Supplementary Materials and Methods).

Collection of patient samples used in this study conformed to all relevant ethical regulations. All patients provided written informed consent, and the studies were performed according to the Declaration of Helsinki. Because all samples analyzed were from previously published cohorts, no study size calculations, randomization, or blinding was performed in the present study.

De novo kmer finding

We first extracted all repeat sequences and coordinates for known repeat element types from the RepeatMasker track in chm13 (T2T-CHM13v2.0). We excluded repeats from the families low-complexity, unknown, and simple repeats, leaving 1287 types of repeats across 57 subfamilies comprising 13 families. For simplicity, we aggregated all elements in the gene and pseudogene families′ transfer RNA (tRNA), signal recognition particle RNA (srpRNA), small nuclear RNA (snRNA), small cytoplasmic RNA (scRNA), and ribosomal RNA (rRNA) as RNA elements and the families′ DNA, DNA?, retroposon, and RC (Rolling Circle) as transposable elements, leaving six overall families (LINE, SINE, LTR, satellites, transposable elements, and RNA elements).

We then performed a de novo kmer finding procedure inspired by Altemose et al. (26) using Jellyfish (54) to count all unique 24-mers occurring in each of the 1287 types of repeats, as well as those occurring in the portions of the genome excluding all repeat regions. We then selected all kmers that occurred only in a single repeat type and that were not present in the nonrepeat regions of the genome. The kmers for a sequence and its reverse complement were counted together as the reference genome represents one strand, but we expect that half of the paired-end reads were derived from the reverse complement strand. We identified at least one unique kmer in 1266 of the 1287 repeat types. We additionally included 58,426 kmers from 14 HSATII (Human Satellite II) and HSATIII (Human Satellite III) subfamilies (26) to supplement the RepeatMasker Satellite annotations. These kmers overlapped with broader satellite types in the RepeatMasker track, but we allowed these kmers to be counted in multiple repeat types for consistency with the previous publication (fig. S6). In total, we identified 1,206,871,310 distinct kmers defining 1280 repeat element types (see Supplementary Materials and Methods). To verify that these kmers had low cooccurrence in common human-associated microbial genomes, we counted kmers in 1545 microbial genomes from the Human Microbiome Project available for download on National Center for Biotechnology Information (NCBI) Entrez (2930). We analyzed the colocalization of these kmers within cancer driver genes using a gene set enrichment analysis and verified that these kmers could be identified in simulations of short-read sequencing incorporating a realistic error rate (see Supplementary Materials and Methods).

Generation of kmer repeat landscapes

We obtained all sequencing reads for each sample, counted each unique kmer and its reverse complement, and aggregated the kmer counts for each repeat type. We normalized the aggregated counts to the number of reads that were aligned with Mapping Quality (MAPQ) ≥ 30 to the hg19 genome (samtools view -c -q 30 -F 3844). Our approach considered all reads, including those from portions of the genome not provided in hg19 and/or repeat types that were not aligned.

We analyzed these kmer repeat landscapes in PCAWG tissue samples, identified alterations in the repeat landscape indicative of structural changes, and determined that many of these changes were in elements not previously implicated in oncogenesis (see Supplementary Materials and Methods). We then analyzed kmer repeat landscapes in cfDNA samples from patients with and without cancer and analyzed structural and epigenetic influences on repeat element representation in the circulation (see Supplementary Materials and Methods).

ARTEMIS machine learning models in tissue

We centered and scaled the coverage normalized counts of the kmer repeat landscape for each tumor and normal tissue sample and trained a penalized logistic regression (PLR) model to generate a cross-validated ARTEMIS score (for each sample, the ARTEMIS score was calculated as the mean across 10 repeats of fivefold cross-validation) for distinguishing tumor from normal tissue samples. We further used kmer repeat landscapes to train a multiclass gradient-boosted model (GBM) to generate cross-validated (fivefold cross-validation) predictions of tumor tissues of origin (for each sample, the model generated a vector of multinomial probabilities, where each element corresponded to a possible tumor tissue of origin and the predicted class was chosen on the basis of the element with maximum value).

ARTEMIS for early detection, tissue of origin, and monitoring of cancer in cfDNA

We obtained a kmer repeat landscape for each sample using the 786 features with more than 1000 kmers per million aligned reads expected. This filtering was used because at low-coverage features with low abundance have greater technical variation (fig. S17). To accommodate ensembling of diverse feature classes, we used nested cross-validation to generate the ARTEMIS score. The inner cross-validation loop trained six PLR models (Lasso regression, ∝ = 1, penalty chosen in the range of 0.00001 to 0.1 by resampling within each cross-validation fold) with repeat landscapes as features (a PLR for each of five repeat families and a PLR for the epigenetic profile; see Supplementary Materials and Methods). The outer cross-validation loop was trained with a leave-one-individual-out architecture; we ensembled the six scores available for each of the N − 1 individuals using a PLR model. The score obtained by applying this PLR model to the six scores for the held-out patient’s features is the ARTEMIS score for cancer detection in cfDNA.

To incorporate DELFI fragmentation profiles, we ensembled the ARTEMIS score with three additional models: a PLR model using principal components analysis of the ratio of short to long fragments in 5-Mb bins genome-wide (27), a PLR model on 39 chromosomal arm z-scores for aneuploidy (27), and a GBM on coverage in 5-Mb bins genome-wide (28). This combined ensemble produced a joint ARTEMIS-DELFI score. We retrained the lung cancer model on the full LUCAS cohort and then applied the locked models to four external validation cohorts: the Johns Hopkins University validation set (n = 431) from Mathios et al. (27); a subset of patients with prior cancers, with and without cancer recurrence in the LUCAS cohort (n = 25); the validation set from the AHN/DECAMP cohort in Bruhm et al. (55); and the lung cancer monitoring cohort from Phallen et al. (n = 19) (50).

Last, we trained multiclass GBMs using the features described above to generate an ARTEMIS and ARTEMIS-DELFI score for tissue of origin classification. The final ARTEMIS model for tissue of origin used the ensemble components described above as features, and the ARTEMIS-DELFI model used these components and additional fragmentation features (the 39 chromosomal arm z-scores for aneuploidy and the ratio of short to long fragments in 5-Mb bins genome-wide). Each ensemble component produced a vector of multinomial probabilities, with one for each possible tumor site. We determined classification on the basis of the element of the vector with the maximum value. When ensembling multiple GBM classifiers, all elements of the vector were used as feature inputs to the ensemble model. The models were trained using the nested cross-validation procedure described above on all cancer samples from Cristiano et al. (28) (n = 423), and performance was reported for all cancers detected at the 90% specificity threshold by the ARTEMIS and ARTEMIS-DELFI detection models when trained on the full cohort including patients without cancer. Consistent with that previous publication, for the tissue of origin analyses, we included the baseline time points from the lung cancer monitoring cohort above to increase the number of lung cancers available for classification analyses.

Statistical analysis

Computer code, software versions, processed data in tabular format used for making figures in the manuscript, and the computing environment for running the ARTEMIS pipeline and generating figures in this study are available at https://github.com/cancer-genomics/artemis2024. This GitHub repository has also been archived on Zenodo at 10.5281/zenodo.10627372P values for two group comparisons were performed using the Wilcoxon rank sum test. Correlation of continuous variables was performed using Spearman’s rank correlation coefficient. Receiver operating characteristic (ROC) curves were compared using DeLong’s test. The 95% CIs for area under the ROC curve were based on DeLong’s method.

Acknowledgments

We thank members of our laboratories for critical review of the manuscript. The results shown here are in part based on data generated by the TCGA Research Network (http://cancer.gov/tcga), the ENCODE Consortium (http://encodeproject.org), and the T2T Consortium (https://sites.google.com/ucsc.edu/t2tworkinggroup/).

Funding: This work was supported in part by the Dr. Miriam and Sheldon G. Adelson Medical Research Foundation (to V.E.V., J.P., and R.B.S.); SU2C in-Time Lung Cancer Interception Dream Team Grant (to V.E.V. and J.P.); Stand Up to Cancer-Dutch Cancer Society International Translational Cancer Research Dream Team Grant (SU2C-AACR-DT1415) (to V.E.V.); the Gray Foundation (to V.E.V. and J.P.); the Honorable Tina Brozman Foundation (to V.E.V. and J.P.); the Commonwealth Foundation (to V.E.V., V.A., and R.B.S.); the Mark Foundation for Cancer Research (to D.M.); the Cole Foundation (to V.E.V.); a research grant from Delfi Diagnostics (to V.E.V. and R.B.S.); and US National Institutes of Health grants CA121113 (to V.E.V.), CA006973 (to V.E.V.), CA233259 (to V.E.V.), CA062924 (to V.E.V. and R.B.S.), CA271896 (to V.E.V.) and 1T32GM136577 (to A.V.A.). Stand Up To Cancer is a program of the Entertainment Industry Foundation administered by the American Association for Cancer Research. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author contributions: A.V.A., R.B.S., and V.E.V. conceptualized the study, designed the methodology, led the formal analysis and data visualization, and wrote the paper. N.N., J.R.W., D.C.B., C.C., J.E.M., V.A., C.H., D.M., Z.H.F., and J.P. contributed to investigation and/or data visualization. J.P., R.B.S., and V.E.V. obtained funding and provided supervision. V.A., J.P., R.B.S., and V.E.V. provided resources and led the project administration. A.V.A., N.N., J.R.W., D.C.B., C.C., R.B.S., and V.E.V. curated the genomic data used in the study. A.V.A. and R.B.S. wrote software and conducted statistical analyses. A.V.A., J.R.W., C.C., and R.B.S. validated the software. All authors reviewed, edited, and approved of the manuscript before submission.

Competing interests: A.V.A., R.B.S., and V.E.V. are inventors on patent applications submitted by Johns Hopkins University related to genome-wide repeat landscapes in cancer and cfDNA (US Patent application number 63/532,642). A.V.A., D.C.B., V.A., D.M., Z.H.F., J.P., and R.B.S. are inventors on patent applications submitted by Johns Hopkins University related to cell-free DNA for cancer detection that have been licensed to Delfi Diagnostics. J.R.W. is the founder and owner of Resphera Biosciences LLC and serves as a consultant to Personal Genome Diagnostics Inc. and Delfi Diagnostics Inc. C.C. is the founder and owner of CMCC Consulting. J.P., V.A., and R.B.S. are founders of Delfi Diagnostics, and V.A. and R.B.S are consultants for this organization. V.E.V. is a founder of Delfi Diagnostics, serves on the board of directors and as an officer for this organization, and owns Delfi Diagnostics stock, which is subject to certain restrictions under university policy. In addition, Johns Hopkins University owns equity in Delfi Diagnostics. V.E.V. divested his equity in Personal Genome Diagnostics (PGDx) to LabCorp in February 2022. V.E.V. is an inventor on patent applications submitted by Johns Hopkins University related to cancer genomic analyses and cell-free DNA for cancer detection that have been licensed to one or more entities, including Delfi Diagnostics, LabCorp, QIAGEN, Sysmex, Agios, Genzyme, Esoterix, Ventana, and ManaT Bio. Under the terms of these license agreements, the university and inventors are entitled to fees and royalty distributions. V.E.V. is an advisor to Viron Therapeutics and Epitope. These arrangements have been reviewed and approved by the Johns Hopkins University in accordance with its conflict-of-interest policies. The remaining authors declare that they have no competing interests….

Sign up for our Newsletter