Molecular Medicine Israel

Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness

First off the COVID block

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has been characterized by waves of transmission initiated by new variants replacing older ones. Given this pattern of emergence, there is an obvious need for the early detection of novel variants to prevent excess deaths. Obermeyer et al. have developed a Bayesian model to compare relative transmissibility of all viral lineages. Using this model, emerging lineages can be spotted together with the mutations that contribute toward transmissibility, not only in Spike, but also in other viral proteins. The model can prioritize lineages as they emerge for public health concern. —CA

Abstract

Repeated emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants with increased fitness underscores the value of rapid detection and characterization of new lineages. We have developed PyR0, a hierarchical Bayesian multinomial logistic regression model that infers relative prevalence of all viral lineages across geographic regions, detects lineages increasing in prevalence, and identifies mutations relevant to fitness. Applying PyR0 to all publicly available SARS-CoV-2 genomes, we identify numerous substitutions that increase fitness, including previously identified spike mutations and many nonspike mutations within the nucleocapsid and nonstructural proteins. PyR0 forecasts growth of new lineages from their mutational profile, ranks the fitness of lineages as new sequences become available, and prioritizes mutations of biological and public health concern for functional characterization.

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has been characterized by repeated waves of cases driven by the emergence of new lineages with higher fitness, in which fitness encompasses any trait that affects the lineage’s growth, including its basic reproduction number (R0), ability to evade existing immunity, and generation time. Rapid identification of such lineages as they emerge along with accurate forecasting of their dynamics is critical for guiding outbreak response. The ability to interrogate the entirety of the global SARS-CoV-2 genomic dataset would be greatly beneficial in doing this effectively. The large size (currently >7.5 million virus genomes) and geographic and temporal variability of the available data present considerable challenges that will become greater as more viruses are sequenced. Current phylogenetic approaches are computationally inefficient on datasets with more than ~5000 samples and take days to run at that scale. Ad hoc methods to estimate the relative fitness of particular SARS-CoV-2 lineages are a computationally efficient alternative (13) but have typically relied on models in which one or two lineages of interest are compared with all others and do not capture the complex dynamics of multiple cocirculating lineages.

Furthermore, estimates of relative fitness based on lineage frequency data alone (25), which can be extended to multiple lineages (135), do not take advantage of additional statistical power that can be gained from analyzing the independent appearance and growth of the same mutation in multiple lineages. Performing a mutation-based analysis of lineage prevalence has the additional advantage of identifying specific genetic determinants of a lineage’s phenotype, which is critical both for understanding the biology of transmission and pathogenesis and for predicting the phenotype of new lineages. The SARS-CoV-2 pandemic has already been dominated by several genetic changes of functional and epidemiological importance, including the spike (S) D614G mutation associated with higher SARS-CoV-2 loads (67). Mutations found in Variants of Concern (VoC), such as S:N439R, S:N501Y, and S:E484K, have been linked to increased transmissibility (8), enhanced binding to ACE2 (9), and antibody escape, respectively (1011). Despite these successes identifying functionally important mutations in the context of a large background of genetic variants of little or no phenotypic consequence remains challenging….

Sign up for our Newsletter