Molecular Medicine Israel

The Human Proteoform Project: Defining the human proteome

Abstract

Proteins are the primary effectors of function in biology, and thus, complete knowledge of their structure and properties is fundamental to deciphering function in basic and translational research. The chemical diversity of proteins is expressed in their many proteoforms, which result from combinations of genetic polymorphisms, RNA splice variants, and posttranslational modifications. This knowledge is foundational for the biological complexes and networks that control biology yet remains largely unknown. We propose here an ambitious initiative to define the human proteome, that is, to generate a definitive reference set of the proteoforms produced from the genome. Several examples of the power and importance of proteoform-level knowledge in disease-based research are presented along with a call for improved technologies in a two-pronged strategy to the Human Proteoform Project.The Human Genome Project (HGP) was a remarkable and unqualified success profoundly transforming and accelerating biological and medical research while converting a ~ $4B public investment into over $700B of economic activity and new industries (1). The challenge of revealing the “Blueprints of Life,” however, is surpassed by the challenge we face today: deriving from these blueprints an understanding of the structures they dictate and how these function within biological systems.Proteins are primary effectors of function in biology, and thus, complete knowledge of their structure and behavior is fundamental to deciphering function in basic and translational research (2). The richness of protein structure and function goes far beyond the linear amino acid sequence dictated by the genetic code. Genetic variation, alternative splicing, and posttranslational modification (PTM) work together to create a rich variety of different proteoforms arising from our genes (Fig. 1) (3). The chemical diversity of proteins is foundational for the biological complexes and networks that control biology yet remains largely unknown. Genome sequence alone does not provide the needed information—only direct analysis of the proteoforms themselves can reveal their composition, enabling studies of their spatial distributions and temporal dynamics in biological systems. We propose here an ambitious initiative to define the human proteome, that is, to generate a definitive set of reference proteoforms produced from the genome (see Box 1).

PROTEOFORM-LEVEL KNOWLEDGE IS ESSENTIAL TO UNDERSTAND BIOLOGICAL FUNCTION

Proteins are the central intermediaries between genotype and phenotype (24). It is not possible to understand the functioning of a biological system if one does not know what protein molecules are present, as well as the nature and abundances of their proteoforms. Knowledge of where the proteoforms are located within cells or tissues, what other proteoforms they interact with to form the multifunctional complexes that carry out critical functions in cell biology, and how they change in response to stimuli is essential. Innovative new tools are needed to comprehensively define the proteome, allowing proteoform abundances, interactors, and locations to be assessed with far greater depth at lower cost. The foundational premise of the HGP, which knowledge of the genome sequence will provide a fundamental understanding of biological systems, will not be realized in the absence of detailed proteoform-level information. This was clearly articulated by Collins et al. (2), “A critical step toward gaining a complete understanding… will be to take an accurate census of the proteins present in particular cell types. It will be a major challenge to catalog proteins present in low abundance or in membranes. Determining the absolute abundance of each protein, including all modified forms, will be an important next step.”The Human Proteoform Project we present here is the critical next step in the quest to understand human health and disease. Several examples from five important disease areas illustrate the critical role of proteoforms in disease and health (Fig. 2). These examples show how disease-driven research has been advanced by discovery of proteoforms and their PTMs.

CENTRAL GOALS AND STRATEGY OF THE PROJECT

The primary objective of this project is to elucidate a complete set of expressed proteoforms derived from the ~20,000 genes encoded in the human genome. We forward a two-pronged strategy: On the one hand, we pursue deep proteoform-level analysis in medically relevant systems (Fig. 2); this will continue to open up fundamental insights into targets and use cases of high biomedical importance. In parallel, we invest heavily in the accelerated development of proteoform discovery and characterization technologies and deploy them for large-scale proteoform analysis to specimens from nominally healthy donors….

Sign up for our Newsletter