Molecular Medicine Israel

Detecting Somatic Mutations in Normal Cells

Highlights
Somatic mosaicism resulting from post-zygotic mutations has been shown to contribute to many diseases including brain-related disorders, in addition to cancer. Emerging data also suggest that mosaicism is common in healthy individuals.

Mutations occurring late in development have very low allele fractions, and their detection requires specialized algorithms and filters that can remove artifacts that arise in sample handling, DNA sequencing, and analysis.

Emerging technologies, such as single-cell sequencing and linked-read sequencing, allow improved phasing of variants, thus increasing detection accuracy.

Somatic mutations have been studied extensively in the context of cancer. Recent studies have demonstrated that high-throughput sequencing data can be used to detect somatic mutations in non-tumor cells. Analysis of such mutations allows us to better understand the mutational processes in normal cells, explore cell lineages in development, and examine potential associations with age-related disease. We describe here approaches for characterizing somatic mutations in normal and non-tumor disease tissues. We discuss several experimental designs and common pitfalls in somatic mutation detection, as well as more recent developments such as phasing and linked-read technology. With the dramatically increasing numbers of samples undergoing genome sequencing, bioinformatic analysis will enable the characterization of somatic mutations and their impact on non-cancer tissues.

Somatic Mosaicism and Challenges in Detecting Mosaic Variants
Genomes from individuals of the same species differ from one another because of a constant influx of genetic mutation and recombination. Single-nucleotide variants (SNVs; see Glossary), copy-number variants (CNVs), transposable element (TE) insertions, and other structural variants (SVs) are common types of genetic variation. Population-level heterogeneity generally arises due to germline mutations that occur before the formation of the zygote, and are inherited by all cells in the offspring. However, heterogeneity within an individual may also exist due to somatic mutations that occur post-zygotically and exist only in a subpopulation of cells. The genetic heterogeneity resulting from somatic mutations is known as somatic mosaicism. Recent papers have attempted to characterize somatic mosaicism [1], but the extent to which it exists, whether specific regions of the genome and nucleotide contexts are more susceptible to it, and how it impacts on normal cellular function remain open questions.

In bulk sequencing data, somatic mutations have variant allele fractions (VAFs) that deviate from those typical of germline mutations (∼0.5/1 for heterozygous/homozygous). The VAF of a somatic mutation depends both on the prevalence of the mutation, which is largely driven by how early the mutation occurs in development, and on the heterogeneity of the tissue selected for sequencing. For example, if a mutation occurs during the first cell division, and every cell produces the same number of descendants, the VAF would be ∼0.25 in an unbiased sample (Figure 1, Key Figure). At the other extreme, if a mutation is uniquely acquired in a post-mitotic cell, the VAF would be infinitesimal (if bulk sequencing with 1 million cells, the VAF would be ∼0.5 × 10−6). In general, somatic mutations occurring earlier during development attain higher VAFs than those occurring later. However, asymmetry in the developmental cell-lineage tree [2], heterogeneity in selective pressure across tissues [3], and technical factors (such low read depth, sequencing errors, and misalignment) can violate this principle.

A great deal of work has been done to develop algorithms for detecting somatic mutations in cancer. However, the VAFs of functionally relevant cancer mutations tend to be higher than those in normal cells because of the selective advantage conferred by those mutations in proliferating cells. Thus, many popular algorithms for cancer are not focused on detecting very low VAF events (e.g., <5% [4]), and comprehensive detection of somatic mutations at arbitrarily small VAFs in normal cells requires alternative methods. In addition, somatic mutations in cancer are typically identified by the tumor–normal design in which tumor tissue is compared to non-cancerous (‘normal’) tissue from the same individual to determine the mutations unique to the tumor. For non-cancer samples, mosaic variants arising early in embryogenesis are often shared among many tissues. This makes it difficult to identify a clear normal cellular subpopulation that can serve as a matched control. With careful selection of tissue specimens, however, it is possible to derive an accurate list of mosaic mutations that allows lineage analysis of cells in an individual. For example, Lodato et al. [5] analyzed heart and brain tissues, which develop from the mesoderm and ectoderm, respectively, to find mosaic mutations informative of brain cell lineage; Behjati et al. [6] compared endoderm-derived gastrointestinal tissues to mouse tail, which consists of both mesodermal and ectodermal tissues, to find early embryonic mutations. The locations of selected specimens within a larger tissue can also be relevant: Martincorena et al. [7] utilized ultra-deep sequencing of multiple nearby fine biopsies to infer spatial patterns and rates of mosaicism in human skin. In this review we provide an overview of somatic mutation analysis in normal cells. We first cover the various platforms and experimental designs including bulk sequencing and single-cell sequencing. We then describe strategies for detecting variants such as phasing of haplotypes, as well as common pitfalls encountered in these analyses. Strategies for Profiling Mosaic Variants Whole-genome sequencing (WGS), whole-exome sequencing (WES), and targeted panels offer tradeoffs between the types of detectable variants and the range of detectable VAFs. WGS produces the most uniform read depth across the genome and enables the detection of most types of somatic mutations, including structural variants. However, detection is limited to relatively high VAF mutations because the high sequencing depth required to detect low VAF mutations remains prohibitively expensive [8]. If attention can be restricted to specific loci, a customized panel can be constructed (e.g., amplicon-seq or targeted hybridization methods) and sequenced at very high depth (e.g., >100 000×). WES offers a compromise between WGS and small panels by targeting the ∼1–2% of the genome that codes for proteins and does not need to be custom-designed.

Characterizing Variants at the Single-Cell Level
Unlike bulk sequencing strategies that pool DNA from thousands or millions of cells, single-cell sequencing attempts to sequence the DNA of only one cell. The advantage is that rare mosaic mutations can be more easily detected: if present in a diploid region of the chosen cell, the mutation will be present on one of two alleles, regardless of its frequency in the surrounding tissue (Figure 1). This shifts the technical difficulties associated with low frequency away from variant detection and onto the cell selection process. To estimate the overall frequency of each mutation in the tissue, multiple single cells must be sequenced, which can be expensive, laborious, and confounded by sampling bias. Hybrid experimental designs integrating bulk (either WGS or targeted) and single-cell approaches can address many of these issues. For example, somatic mutations discovered in bulk can be confirmed by single-cell data, and frequencies for somatic mutations discovered in single cells can be estimated from bulk sequencing.

A common strategy to produce sufficient input DNA for next-generation sequencing from a single cell is clonal expansion, in which a cell is expanded in culture until there are sufficient cells to perform standard bulk sequencing [6, 9, 10, 11, 12, 13]. However, additional mutations – especially SNVs – are continuously acquired during expansion, and these must be differentiated from mutations that existed in the founding cell. This is often addressed by discarding low VAF candidate mutations because in vitro mutations acquired after the first mitosis should be present at <25% VAF if cell division in culture is approximately symmetric. However, this symmetry assumption could be violated by variability in cell cycle length and the potential for selectively advantageous mutations in vitro, and careful analysis is therefore warranted. It has also been shown that in vitro mutations can be characterized by mutational signatures that correlate with increasing culture time [6, 10]. An additional concern is that only a subset of the isolated single cells may successfully expand into colonies, possibly reflecting differences in cell fitness, tolerance to handling and cell culture, or stochastic effects. Thus, studies relying exclusively on clonal expansion might not provide an accurate picture of tissue heterogeneity owing to biased loss of specific cell types. For post-mitotic cell types (e.g., neurons), clonal expansion is not directly applicable. Encouragingly, a recent study demonstrated that adult neurons in mice could be clonally expanded and sequenced after inducing totipotency via single-cell nuclear transfer (SCNT) [14]. However, SCNT is labor-intensive, notoriously inefficient, and may be even further affected by selection biases. Another widely used approach to produce enough DNA from a single cell is to apply whole-genome amplification (WGA) [15, 16, 17] followed directly by sequencing. This approach has been used both in cancer [18, 19, 20] and in development [5, 21, 22, 23, 24]. Several methods for WGA are available, and represent different tradeoffs between genomic coverage, amplification uniformity, and artifact load, and are reviewed elsewhere [15]. Because cell culture is unnecessary, WGA-based methodologies enjoy significant cost savings in both labor and reagents, and can be directly applied to post-mitotic cells (such as neurons) and cells that are difficult to culture. The technical simplicity of WGA has also made it an attractive technology for scaling to handle hundreds or thousands of cells simultaneously [25, 26]. However, the disadvantage of WGA is the introduction of considerable amplification bias and allelic imbalance/allele dropout, which can produce artifacts that can be difficult to distinguish from true mutations. Research to improve variant calling despite these amplification artifacts is ongoing. It was recently demonstrated [27] that good specificity can be achieved for SNV detection for candidate somatic mutations that can be linked to nearby germline heterozygous variants (∼20% of the candidates, if using standard Illumina sequencing; discussed further below).....

Sign up for our Newsletter