Data-driven early detection of potential high risk SARS-CoV-2 variants

Recent advances in Deep Learning have enabled us to explore new application domains in molecular biology and drug discovery – including those driven by complex processes that defy analytical modelling. However, despite the combined forces of increased data, improving compute resources and continuous algorithmic innovation all bringing previously intractable problems into the realm of possibility, many advances are yet to make a tangible impact for life science discovery. In this talk, Dr Marcin J. Skwark will discuss the challenge of bringing machine learning innovation to tangible real-world impact. Following a general introduction of the topic, as well as newly available methods and data, he will focus on the modelling of COVID-19 variants and, in particular, the DeepChain Early Warning System (EWS) developed by InstaDeep in collaboration with BioNTech. With thousands of new, possibly dangerous, SARS-CoV-2 variants emerging each month worldwide, it is beyond humanities combined capacity to experimentally determine the immune evasion and transmissibility characteristics of every one. EWS builds on an experimentally tested AI-first computational biology platform to evaluate new variants in minutes, and is capable of risk monitoring variant lineages in near real-time. This is done by combining an AI-driven protein structure prediction framework with large, spike protein sequence-oriented Transformer models to allow for rapid simulation-free assessment of the immune escape risk and expected fitness of new variants, conditioned on the current state of the world. The system has been extensively validated in cooperation with BioNTech, both in terms of host cell infection propensity (including experimental assays of receptor binding affinity), and immune escape (pVNT assays with monoclonal antibodies and real-life donor sera). In these assessments, purely unsupervised, data-first methods of EWS have shown remarkable accuracy. EWS flags and ranks all but one of the SARS-CoV-2 Variants of Concern (Alpha, Beta, Gamma, Delta… Omicron), discriminates between subvariants (e.g. BA.1/BA.2/BA.4 etc. distinction) and for most of the adverse events allows for proactive response on the day of the observation. This allows for appropriate response on average six weeks before it is possible for domain experts using domain knowledge and epidemiological data. The performance of the system, according to internal benchmarks, improves with time, allowing for example for supporting the decisions on the emerging Omicron subvariants on the first days of their occurrence. EWS impact has been notable in general media [2, 3] for the system’s applicability to a novel problem, ability to derive generalizable conclusions from unevenly distributed, sparse and noisy data, to deliver insights which otherwise necessitate long and costly experimental assays.