Molecular Medicine Israel

Exome sequencing identifies breast cancer susceptibility genes and defines the contribution of coding variants to breast cancer risk

Abstract

Linkage and candidate gene studies have identified several breast cancer susceptibility genes, but the overall contribution of coding variation to breast cancer is unclear. To evaluate the role of rare coding variants more comprehensively, we performed a meta-analysis across three large whole-exome sequencing datasets, containing 26,368 female cases and 217,673 female controls. Burden tests were performed for protein-truncating and rare missense variants in 15,616 and 18,601 genes, respectively. Associations between protein-truncating variants and breast cancer were identified for the following six genes at exome-wide significance (P< 2.5 × 10−6): the five known susceptibility genes ATMBRCA1BRCA2CHEK2 and PALB2, together with MAP3K1. Associations were also observed for LZTR1ATR and BARD1 with P < 1 × 10−4. Associations between predicted deleterious rare missense or protein-truncating variants and breast cancer were additionally identified for CDKN2A at exome-wide significance. The overall contribution of coding variants in genes beyond the previously known genes is estimated to be small.

Main

Breast cancer is the leading cause of cancer-related mortality for women worldwide. Genetic susceptibility to breast cancer is known to be conferred by common variants, identified through genome-wide association studies (GWAS), together with rarer coding variants conferring higher disease risks. The latter, identified through genetic linkage or targeted sequencing studies, includes protein-truncating variants (PTVs) and some rare missense variants in ATM, BARD1BRCA1BRCA2CHEK2RAD51CRAD51DPALB2 and TP53 (ref. 1). However, these variants together explain less than half the familial relative risk (FRR) of breast cancer2. The contribution of rare coding variants in other genes remains largely unknown.

Here we used data from the following three large whole-exome sequencing (WES) studies, primarily of European ancestry, to assess the role of rare variants in all coding genes: the Breast Cancer Risk after Diagnostic Gene Sequencing (BRIDGES) dataset that included samples from eight studies in the Breast Cancer Association Consortium (BCAC), the PERSPECTIVE (Personalized Risk assessment for prevention and early detection of breast cancer: integration and implementation) dataset that included three BCAC studies (Supplementary Table 1) and UK Biobank (UKB). After quality control (QC; Methods), these datasets comprised 26,368 female cases and 217,673 female controls (Supplementary Table 2).

We considered the following two main categories of variants: PTVs and rare missense variants (minor allele frequency <0.001). Single-variant association tests are generally underpowered for rare variants; however, burden tests, in which variants are collapsed together, can be more powerful if the associated variants have similar effect sizes3. To further improve power, we incorporated data on family history of breast cancer (Methods)4. Association tests were conducted for all genes in which there was at least one carrier of a variant (15,616 genes for PTVs and 18,601 genes for rare missense variants).

In the PTV meta-analysis, 30 genes were associated at P < 0.001 (Supplementary Table 3 and Figs. 1 and 2). Of these, six met exome-wide significance (P < 2.5 × 10−6), of which five are known breast cancer risk genes— ATM, BRCA1BRCA2CHEK2 and PALB2. Associations were also identified for PTVs in MAP3K1 (P = 1.2 × 10−9). Associations at P < 1 × 10−4 were also identified for PTVs in LZTR1, ATR interacting protein (ATRIP) and the known risk gene BARD1. Of the other previously identified breast cancer susceptibility genes, associations with P < 0.01 were observed for CDH1 and RAD51D (Supplementary Table 4). Associations significant at P < 0.01 were not observed for other known susceptibility genes, but PTV frequencies were very low and the confidence limits include the previous odds ratio (OR) estimates1,5.

There was no evidence for an excess of associations significant at P < 0.001 after allowing for the six exome-wide significant genes (Fig. 2). However, 28 of the 30 associations at P < 0.001 correspond to an increased risk, compared with ~15 that would be expected by chance. This imbalance suggests some of the other associations may be genuine.

We performed additional analyses by age and (within the BCAC dataset) the following disease subtypes: estrogen receptor (ER)+ or ER, progesterone receptor (PR)+ or PR and triple-negative disease. When restricting the age of cases to <50 years, 40 genes were associated (all with increased risk) at P < 0.001, suggesting an enrichment of associations in this age group (Supplementary Table 5), MGAT5 met exome-wide significance, in addition to BRCA2, BRCA1, CHEK2, PALB2, ATM and MAP3K1. In the subtype-specific analyses (Supplementary Table 6), the expected associations for known genes were observed, importantly, the higher OR for ER and triple-negative disease for BRCA1 and higher OR for ER+ disease for CHEK2, but no other genes were associated with subtype-specific disease at exome-wide significance.

For the rare missense variant meta-analysis, 28 genes had a P < 0.001, 18 of which corresponded to an increased risk of breast cancer (Supplementary Table 7 and Extended Data Figs. 1 and 2) compared to 14 expected by chance. Only CHEK2 met exome-wide significance (P = 7.0 × 10−19). Associations with P < 1 × 10−4 were also observed for rare missense variants in SAMHD1HCN2, CLIC6 and ACTL8.

We next considered missense variants predicted deleterious combined with PTVs. We defined deleterious missense variants using Combined Annotation Dependent Depletion (CADD score > 20) (ref. 6) and Helix (Helix score > 0.5) (ref. 7). When using CADD, 33 genes had a P < 0.001, 22 of which corresponded to an increased risk of breast cancer (Supplementary Table 8 and Extended Data Figs. 3 and 4). Six genes met exome-wide significance, including the following known five risk genes: CHEK2 (P = 2.8 × 10−66), BRCA2 (P = 7.2 × 10−44), PALB2 (P = 4.5 × 10−26), ATM (P = 3.3 × 10−21) and BRCA1 (P = 1.6 × 10−17), together with CDKN2A (P = 8.3 × 10−7). Associations with P < 1 × 10−4 were also observed for SAMHD1MRPL27EXOC4 and PPP1R3B. When instead defining deleterious rare missense variants using Helix and combining with PTVs, 29 genes had a P < 0.001, 25 of which corresponded to an increased risk of breast cancer (Supplementary Table 9). Only the known five genes met exome-wide significance. Associations with P < 1 × 10−4 were also observed for LZTR1, MAP3K1, DCLK1, MDM4, STX3 and ATRIP.

Notably, of the genes with P < 1 × 10−4MAP3K1, LZTR1, ATRIP, CDKN2A and SAMHD1 have prior evidence of being tumor suppressor genes (TSGs). MAP3K1 is a stress-induced serine/threonine kinase that activates the extracellular signal-regulated kinase (ERK) and Jun N-terminal kinase (JNK) pathways by phosphorylation of MAP2K1 and MAP2K4 (refs. 8,9). Inactivating variants in MAP3K1 are one of the commonest somatic driver events in breast tumors10,11MAP3K1 is also a well-established breast cancer GWAS locus12; at least three independent signals have been identified mapping to regulatory regions with MAP3K1 expression as the likely target13,14. To evaluate whether the MAP3K1 PTV association we observed was driven by the GWAS associations, or vice-versa, we fitted logistic regression models to UKB data in which the PTV burden variable and the lead GWAS SNPs (SNP1: rs62355902, SNP2: rs984113 and SNP3: rs112497245) were considered jointly (Supplementary Table 10). In the model with all variables, the OR associated with carrying a PTV (OR = 4.95 (2.27, 10.82)) was similar to the unadjusted OR. Similarly, the ORs for each of the SNPs were similar to the ORs without adjustment for PTVs. This suggests that the PTV burden and GWAS associations are independent and reflect the distinct effects of inactivating coding alterations and regulatory variants.

ATRIP codes for a DNA damage response protein that forms a complex with ATR. ATR–ATRIP is involved in the process that activates checkpoint signaling when single-stranded DNA is detected following the processing of DNA double-stranded breaks or stalled replication forks15,16LZTR1 codes for a protein found in the Golgi apparatus17. Germline mutations in LZTR1 have been associated with schwannomatosis, a rare tumor predisposition syndrome18,19CDKN2A also codes for tumor suppressor proteins, including p16(INK4A) and p14(ARF) (ref. 20). CDKN2A is a known melanoma21 and pancreatic cancer susceptibility gene and is an important TSG altered in a wide variety of tumors, including breast cancer22. There have been some previous suggestions that deleterious germline CDKN2A is associated with breast cancer risk23,24SAMHD1 promotes the degradation of nascent DNA at stalled replication forks, limiting the release of single-stranded DNA25SAMHD1 also encodes dNTPase that protects cells from viral infections26 and is frequently mutated in multiple tumor types, including breast cancer. Furthermore, damaging germline variants in SAMHD1 have recently been associated with delayed age at natural menopause and increased all-cause cancer risk27MDM4 encodes a p53 repressor overexpressed in a variety of tumors28 and is also a GWAS locus for triple-negative breast cancer13,29.

Pathology information was available for cases in the BCAC dataset. We tabulated pathology characteristics for carriers of variants in genes with P < 1 × 10−4 in the meta-analysis of PTVs or the meta-analysis of predicted deleterious (CADD) rare missense variants combined with PTVs (Supplementary Tables 11 and 12). These data suggest a slightly higher proportion of mixed lobular and ductal tumors for LZTR1 PTV carriers and MRPL27 deleterious rare missense variant or PTV carriers. There was a slightly higher proportion diagnosed >50 years for ATRIP PTV carriers and a higher proportion of HER2+ tumors for EXOC4 deleterious rare missense variant or PTV carriers. However, the number of carriers is small in each case.

We performed Gene Set Enrichment Analysis (GSEA) based on the PTV associations for pathways in the Kyoto Encyclopedia of Genes and Genomes (KEGG), BioCarta and Reactome. We did this for all genes including and excluding BRCA1BRCA2ATMCHEK2 and PALB2. When including the five genes, 28 pathways had a false discovery rate (q) < 0.05 (Supplementary Table 13). Of these, all but one (Reactome peptide hormone biosynthesis) include BRCA1 or BRCA2. The top pathway was Reactome DNA double-strand break repair. After excluding the five genes (Supplementary Table 14), the strongest enrichment was for the BioCarta NFKB and CD40 pathways (which contain MAP3K1), Reactome DNA double-strand break response and Reactome hormone peptide biosynthesis (all q < 0.10).

To evaluate the overall contribution of PTVs to the FRR, we fitted models to the effect size using an empirical Bayes approach. We used whole-genome sequencing data in UKB to estimate the missing contribution due to large rearrangements. Under the assumption of an exponentially distributed effect size, the estimated proportion of risk genes (a) was 0.0047 with a median OR of 1.38. Under this model, an estimated 10.61% of the FRR would be explained by all PTVs, of which 9.64% would be due to the five genes BRCA1BRCA2ATMCHEK2 and PALB2 and 0.97% due to all other genes combined with MAP3K1 contributing 0.14% (Supplementary Table 15). Only the six genes reaching exome-wide significance for PTVs had a posterior probability of association >0.90. We repeated these analyses using the subsets of genes including breast cancer driver genes and target genes of GWAS signals identified in ref. 13, the list of cancer predisposition genes identified in ref. 30, Catalogue Of Somatic Mutations In Cancer (COSMIC) TSGs31 and the top pathways identified by GSEA (Supplementary Table 16). The largest contributions to the FRR, after excluding the five known genes, were for the BioCarta CD40 pathway (0.657%, n = 16, a = 0.628) and COSMIC TSGs (0.639%, n = 320, a = 0.196). Thus, these results suggest that the majority of the remaining risk genes are TSGs.

These results demonstrate that large exome sequencing studies, combined with efficient burden analyses, can identify additional breast cancer susceptibility genes. The excess of positive associations at P < 0.001 indicates that further genes should be identifiable through large datasets—the heritability analyses suggest the number of associated genes might be of the order of 90, with the majority of these being TSGs. Although the estimated risks associated with the new genes, in particular MAP3K1 PTVs, would be large enough to be of clinical relevance, the effect sizes might be over-estimated due to the ‘winner’s curse’32. Thus, further replication in larger datasets will also be necessary to provide more precise estimates for variants in the new genes, to define the set of variants in these genes associated with breast cancer, the clinic-pathological characteristics of tumors in variant carriers and the combined effects of pathogenic variants and other risk factors. The heritability analyses suggest that most of the contribution of PTVs is mediated through the five genes BRCA1BRCA2ATMCHEK2 and PALB2, commonly tested for in clinical cancer genetics33. These analyses assume dominant inheritance and recessive genes may also contribute to the familial risk, while subsets of missense variants may also make important contributions (exemplified by CDKN2A and SAMHD1). However, these results suggest that the majority of the ‘missing’ heritability is likely to be found in the noncoding genome.

Methods

UKB

The UKB is a population-based prospective cohort study of more than 500,000 subjects. More detailed information on the UKB is given elsewhere34,35. The study received ethics approval from the North West Multi-center Research Ethics Committee. All participants signed written informed consent before participating. WES data for 450,000 subjects were released in October 2021 and accessed via the UKB DNA Nexus platform36. QC metrics were applied to Variant Call Format (VCF) files as discussed in ref. 37, including genotype level filters for depth and genotype quality.

Samples with disagreement between genetically determined and self-reported sex, sex aneuploidy or excess relatives in the dataset were excluded. Excess relatives were identified by considering pairs of individuals with kinship >0.17. If one individual in a pair was a case and one was a control then the case was preferentially selected; otherwise, one individual was selected at random. Genetic ancestry was estimated using genetic principal components and the Gilbert–Johnson–Keerthi distance algorithm38. If genetic principal components were not available, self-reported ancestry was used. Samples of ancestry other than European were excluded. The final dataset for analysis included 419,307 samples with 227,393 females. Cases were defined by having invasive breast cancer (International Classification of Diseases (ICD)-10 code C50) or carcinoma in situ (D05), as determined by linkage to the National Cancer Registration and Analysis Service (NCRAS), or self-reported breast cancer. Both prevalent and incident cases were included. Only breast cancers that were an individual’s first or second diagnosed cancer were included as cases. By this definition, 17,958 female and 94 male cases were included.

For structural variants, we accessed the structural variant population VCF files for the initial release of UKB whole-genome sequencing via the DNA Nexus platform. These deletions, duplications and insertions were called using GraphTyper (2.7.1) (refs.39,40). We identified any structural variant that passed the GraphTyper QC filters and overlapped an exon of the MANE transcripts of each gene. The samples were filtered using the above exclusions leaving 64,264 samples (4,847 female breast cancer cases and 59,417 female controls). The frequency of structural variants was then calculated for each gene and used to adjust the PTV frequency (Supplementary Methods).

The BCAC datasets

The BRIDGES and PERSPECTIVE samples were from studies in the BCAC (BRIDGES: eight studies, PERSPECTIVE: three studies; Supplementary Table 1). All studies were approved by ethical review boards (Supplementary Table 17). All subjects provided written informed consent. Most samples were previously included in a targeted panel sequencing project1. Phenotype data were based on the BCAC database v14. Samples were oversampled for early-onset (age of diagnosis below 50 years) or family history of breast cancer. Cases were preferentially selected to have information on tumor pathology. Samples with previously identified pathogenic mutations in BRCA1BRCA2 or PALB2 (348 cases, 176 controls) were not included….

Sign up for our Newsletter