Infections@BDI Seminar - The use of Kmer counts to train random forests to predict country of origin for bacterial pathogen sequencing data

Two of the key successes of genomic epidemiology and routine sequencing for public health microbiology are monitoring of outbreaks and source attribution. Since 2014, Public Health England (PHE) have used routine whole genome sequencing for all clinical isolates of Salmonella and Shiga toxigenic Escherichia coli (STEC) for those exact purposes. This has generated a huge breadth of data, to date there are >100,000 gastrointestinal pathogens that have been sequenced at PHE. However, large datasets of sequenced pathogens are not unique anymore and many institutions and universities are generating their own large microbial genomics datasets. What is unique about this dataset is the detailed metadata and epidemiological information that is also stored alongside it in PHE’s Gastro Data Warehouse from real clinical cases of patients in the UK. Enhanced surveillance of foodborne pathogens means that each sequenced case of STEC and Salmonella also has collected information about the region of the case, any recent foreign travel of the patient and also what virulence factors were associated with the strain. Here we present work on the use of this wealth of associated metadata in conjunction with the publicly available sequencing data to train random forest algorithms on kmer count data to predict the country that the strain is likely to have originated from for newly generated sequences. This will enable a fully automated method to reproduce source attribution but also monitor international outbreaks. This will build on the two successes of genomic epidemiology in a natural progression to automation of it, that will have the potential to democratize genomic epidemiology by extending its usability to unskilled practitioners not trained in phylogenetics or bioinformatics.