Molecular Medicine Israel

Uncovering the functional diversity of rare CRISPR-Cas systems with deep terascale clustering

Editor’s summary

Microbial biochemicals systems are incredibly diverse, and computational tools to analyze sequence data are essential in identifying new and valuable components for biotechnology development. Using an approach called deep terascale clustering, Altae-Tran et al. found more than 200 new functional systems linked to CRISPR, a technology editing DNA. Some of the discovered genes are linked to precise DNA-editing systems that may enable safer therapeutic genome editing. The authors also identified a CRISPR-Cas enzyme, Cas14, which cuts RNA precisely. These discoveries may help to further improve DNA- and RNA-editing technologies, with wide-ranging applications in medicine and biotechnology. —Di Jiang

Structured Abstract


Systematic mining of sequencing databases is a powerful method for discovering protein families and functional systems. This approach has uncovered diverse CRISPR-Cas systems, which are microbial RNA–guided adaptive immune systems that have served as the basis of several molecular technologies, notably programmable genome editing. However, existing methods for sequence mining lag behind the exponentially growing databases that now contain billions of proteins, which restricts the discovery of rare protein families and associations.


We sought to comprehensively enumerate CRISPR-linked gene modules in all existing publicly available sequencing data. Recently, several previously unknown biochemical activities have been linked to programmable nucleic acid recognition by CRISPR systems, including transposition and protease activity. We reasoned that many more diverse enzymatic activities may be associated with CRISPR systems, many of which could be of low abundance in existing sequence databases.


We developed fast locality-sensitive hashing–based clustering (FLSHclust), a parallelized, deep clustering algorithm with linearithmic scaling based on locality-sensitive hashing. FLSHclust approaches MMseqs2, a gold-standard quadratic-scaling algorithm, in clustering performance. We applied FLSHclust in a sensitive CRISPR discovery pipeline and identified 188 previously unreported CRISPR-associated systems, including many rare systems.

We experimentally characterized four of the newly discovered systems. We examined a type IV system with an HNH nuclease domain inserted in the CRISPR-associated DNA damage-inducible gene G (DinG)–like helicase. We found that this system exhibited RNA-guided protospacer-adjacent motif (PAM)–dependent directional double-stranded DNA (dsDNA) degradation, which required both the adenosine triphosphate (ATP) hydrolysis and HNH nuclease functions of the DinG-HNH protein. This is the first demonstration of a type IV system with a specified interference mechanism. We characterized two type I systems containing HNH nuclease domains inserted in different subunits of Cascade (Cas8-HNH and Cas5-HNH). We found that both of these systems performed precise dsDNA cleavage and single-stranded DNA (ssDNA) cleavage. We additionally observed collateral cleavage of ssDNA by the Cas5-HNH system. We demonstrated that both systems can be applied for genome editing in human cells and that the Cas8-HNH system is highly specific. We also studied candidate type VII systems, including a minimal Cas7-Cas5 effector complex and a distinctive interference protein including a β-CASP domain. We showed that these systems are likely derived from type III-E CRISPR systems and are RNA targeting.

Other CRISPR-linked systems that we found include additional potential effector and adaptation components, two previously unknown associations of Mu transposons with CRISPR systems, and numerous newly identified proteins and domains associated with type V systems. We also identified an instance of potential co-option of a Cas9 as an anti-CRISPR mechanism and noted several non-CRISPR hypervariable regularly interspersed repeat arrays.


This study introduces FLSHclust as a tool to cluster millions of sequences quickly and efficiently, with broad applications in mining large sequence databases. The CRISPR-linked systems that we discovered represent an untapped trove of diverse biochemical activities linked to RNA-guided mechanisms, with great potential for development as biotechnologies.

Sign up for our Newsletter