Molecular Medicine Israel

Global detection of human variants and isoforms by deep proteome sequencing

Abstract

An average shotgun proteomics experiment detects approximately 10,000 human proteins from a single sample. However, individual proteins are typically identified by peptide sequences representing a small fraction of their total amino acids. Hence, an average shotgun experiment fails to distinguish different protein variants and isoforms. Deeper proteome sequencing is therefore required for the global discovery of protein isoforms. Using six different human cell lines, six proteases, deep fractionation and three tandem mass spectrometry fragmentation methods, we identify a million unique peptides from 17,717 protein groups, with a median sequence coverage of approximately 80%. Direct comparison with RNA expression data provides evidence for the translation of most nonsynonymous variants. We have also hypothesized that undetected variants likely arise from mutation-induced protein instability. We further observe comparable detection rates for exon–exon junction peptides representing constitutive and alternative splicing events. Our dataset represents a resource for proteoform discovery and provides direct evidence that most frame-preserving alternatively spliced isoforms are translated.

Main

Near-complete proteomes of simple organisms can be detected by mass spectrometry (MS) following only 1 h of analysis1,2. For more complex organisms, it is possible to monitor over 10,000 proteins within a day (refs. 3,4,5,6,7). Community-based maps of the human proteome, assembled using extensive data from various tissues and cell types from laboratories across the world, have provided evidence for the translation of >90% of annotated protein-coding genes7,8. However, although the human genome contains approximately 20,000 protein-coding genes9,10, it is estimated that alternative splicing events, whereby precursor messenger RNA sequences are combined in different arrangements, have the potential to notably increase proteome diversity. Specifically, from RNA sequencing (RNA-seq) analysis of human organs, reports have estimated that transcripts from more than 95% of multi-exon genes undergo alternative splicing11,12. Furthermore, recent single-cell transcriptome sequencing has revealed that true splice isoform complexity is likely greater than previously appreciated13,14. Other sources of proteome variation, such as single-amino acid polymorphisms (SAPs), alternative splicing and posttranslational modifications, further increase proteomic complexity15,16,17,18,19,20.

Limitations in proteomic technology have not permitted the global-scale detection of protein diversity. Typically, for shotgun proteomic methods, the presence of an entire protein is determined using a small number of peptide proxies—as few as two or three. Thus, sequence coverage in a proteomics experiment is generally insufficient to fully characterize all protein states present within a sample21,22. Yet the ability to precisely monitor protein isoforms is essential to understanding biological systems. Even the current deepest proteomic datasets23,24 do not contain enough sequence data to globally identify proteoforms. One approach to achieving proteoform-level detection is top-down MS, a strategy that measures intact protein mass before dissociation for sequence determination using tandem mass spectrometry (MS/MS). Ensuring no loss in resolution, the top-down strategy is appealing. Practical issues with high-mass proteins, sequence coverage and detection of low-abundance species, however, limit its impact25.

Given the technical hurdles with top-down proteomics, we revisited the shotgun strategy. Shotgun proteomics preferentially relies on trypsin to catalyze hydrolysis of proteins. Trypsin cleaves C-terminal to lysine and arginine residues and produces peptides of length and charge distributions most amenable to MS/MS. However, even with the assistance of extensive chromatographic separation, not all portions of the proteome are accessible from tryptic peptides26,27; many of the peptides produced are either too short or too long to be detected using current liquid chromatography–mass spectrometry (LC–MS) technology. As proteoforms can differ by a small number of amino acids, extensive sequence coverage is crucial for distinguishing near-identical variants. The use of alternative enzymes in addition to trypsin during digestion can increase the amino acid coverage of individual proteins, phosphorylation sites and whole proteomes28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44. However, given the considerably increased effort involved, this strategy is not amenable to routine use and to our knowledge has not been previously employed for the global-scale detection of proteoforms.

In this study, we investigate whether the separate digestion of human proteomes expressed in six different cell lines with six different proteases, coupled with extensive liquid chromatography (LC) fractionation and state-of-the-art MS, produces sufficient sequence depth to afford a global assessment of how genomic variants and alternative splicing are incorporated into the proteome. Generated peptides were extensively fractionated before analysis on an Orbitrap Tribrid mass spectrometer, where they were dissociated using various fragmentation methods, including higher-energy collisional dissociation (HCD)45, collisionally activated dissociation (CAD)46 and electron transfer dissociation (ETD)47,48. We collected ~20 million high-resolution mass spectra and ~164 million MS/MS spectra from ~2,500 nano-scale liquid chromatography-tandem mass spectrometry (nLC–MS/MS) experiments. The combined data enabled identification of 17,717 unique proteins with an overall median sequence coverage of 79.2%. Using these data, we provide a global view of genomic and transcriptomic sequence variant expression at the protein level. From a direct comparison with quantitative RNA-seq data, we detect ~80% of SAPs and ~20% of exon–exon junctions, representing both inclusion and skipping of frame-preserving alternative splicing events. However, for proteins with the highest proteomics sequence coverage, represented by genes with relatively high expression (that is, log2 of reads per kilobase per million (RPKM) of ≥7) at the transcript level, ~64% of frame-preserving alternatively splicing events are detected and the rates of detection of constitutively spliced and alternatively spliced junctions are similar. And finally, using the extensive, overlapping peptide sequence information provided by this resource, we demonstrate the feasibility of de novo protein assembly. Data generated from the present study represent the deepest proteomics map collected to date and have been compiled into an online resource at deep-sequencing.app. These methods and resources lay the foundation for comprehensive mapping of protein diversity and are expected to catalyze future research efforts.

Results

Deep human proteome sequencing

In silico tryptic digestion of the ~21,030 reviewed canonical protein sequences of the human proteome (UniProtKB/Swiss-Prot) predicts 2.3 million tryptic peptides of suitable size for MS detection (7–35 amino acids, up to two missed cleavages). These peptides comprise 9.9 million amino acid residues of the 11.5 million total—that is, only 86% of the proteome. If we consider digestion of the same proteins using the six enzymes in our study (LysC, LysN, AspN, chymotrypsin, GluC and trypsin), 7.4 million peptides suitable for shotgun proteomics are generated. These peptides cover 99% of the amino acids contained in the human proteome.

To test the hypothesis that we can in such manner increase coverage of the human proteome, we selected six diverse human cell lines: hES1, an embryonic stem cell line; HeLa S3, from cervical carcinoma; HepG2, from liver carcinoma; GM12878, a blood lymphoblastoid line; K562, from chronic myeloid leukemia; and HUVEC, from umbilical vein epithelial cells (Fig. 1). Having been included in the Encyclopedia of DNA Elements (ENCODE) project, these cell lines have a large amount of publicly available genomic and transcriptomic data49. Proteins from each cell line were separately digested with the six proteases listed above. To maximize depth, the resultant peptides were heavily fractionated (24–80 fractions) and analyzed using nano flow LC coupled with quadrupole-Orbitrap–linear ion trap hybrid MS systems. Dissociation for MS/MS was achieved using HCD, CAD and ETD. The resulting 2,491 raw files were simultaneously analyzed by database search to identify proteins and peptides using the Andromeda search engine50 inside MaxQuant51,52, and results were sequentially filtered to 1% peptide spectrum matches (PSMs) and protein-level false discovery rate (FDR) over the whole dataset….

Sign up for our Newsletter