papers please: host attribution of salmonella typhimurium via machine learning
posted on november 7, 2023 by antonia chalka
antonia chalka takes us behind the scenes of their latest publication 'the advantage of intergenic regions as genomic features for machine-learning-based host attribution of salmonella typhimurium from the usa' published in microbial genomics.
i am antonia chalka, a phd student in professor david gally’s group at the roslin institute of the university of edinburgh, scotland. my research applies machine learning to bacterial genome sequences in order to predict the host animal species that the bacteria have come from, known as host or source attribution. in this paper i have applied this approach to salmonella, an important cause of foodborne illness in humans and animals. such attribution has been a long-standing challenge in the field of microbiology as defining the likely source of an outbreak strain, chicken, cattle or pigs for example, can help control outbreaks and improve food safety.
this task is complicated by the genetic diversity within the salmonella genus, which makes host attribution based solely on phylogeny challenging. traditional methods have primarily relied on such phylogeny, represented through multi-locus sequence typing (mlst) or whole genome single nucleotide polymorphisms (snp) to identify the closest isolates and attribute the host based on this relationship. however, these traditional approaches have limitations, but recent advancements in machine learning have opened new doors to improve the accuracy of host attribution with genome data.
much ado has been made about machine learning especially in the past few years, with the proliferation of algorithms and frameworks such as deep learning, and more recently, large language models like chatgpt. our research uses the more humble ‘random forest’ approach which combines multiple decision trees to make predictions. when dealing with a complex trait like host specificity, relying solely on association analysis is insufficient to paint the full picture. machine learning allows collections of features to be considered and their relationship to each other (e.g. in the format of a decision tree), allowing us to make predictions based on a more comprehensive set of characteristics, rather than just the closest genetic relatives.
our laboratory has been interested in studying host attribution using machine learning. our recent paper delves into the attribution of hosts to usa salmonella typhimurium isolates, building upon dr. lupolova’s previous work. we have now employed a larger dataset and a more sophisticated approach, streamlining the process into a reusable pipeline that can be applied to similar problems in the future, such as host attribution in different pathogens or exploring bacteriophage specificity. additionally, we aimed to identify the most effective genomic characteristics as features for building host attribution models, and we compared building models on antimicrobial resistance (amr), pangenome gene clusters (pv), intergenic regions (igr), and snps.
our results were encouraging; our models were either on a par with, or outperformed traditional nearest-neighbour models. specifically, models based on snp data proved to be as effective as the traditional approach, whereas models trained on pv or igr data demonstrated superior performance over phylogenetic-based assignment.
nevertheless, our machine learning models, like traditional methods, are constrained by the population structure on which they were trained. if an isolate falls outside the clades present in the training set, the predictions may become unreliable. though we do consider that our methodology enables a more accurate way for host assignment over traditional phylogeny-based techniques, our aim is for machine-learning-based host attribution to complement, not replace, existing approaches.
in addition to our results, we hoped to identify genes and regulatory regions responsible for host specificity by extracting them from the features deemed important during model training. this, however, proved to be a challenging and potentially misguided endeavour, given the complex web of phylogeny and genetic interactions at play.
creating the initial models and setting up the model training pipeline presented its own set of challenges. in the world of bioinformatics, one knows that a significant part of the job is wrangling file formats and making various components work seamlessly. automating this process was initially daunting due to the numerous tools and components that needed to come together. we used nextflow alongside docker to make our pipeline, which is available on github. we used our 1.0 for our paper and are in the process of improving it with additional features and workflows.
our long-term goal is to expand our approach and integrate it into diagnostics and outbreak investigations alongside existing techniques. machine learning has the potential to be an invaluable tool to interrogate bacterial genomic data, and we aim to make a set of models publicly available for testing sequences. the aim is to have any interested party drop a sequence into a website and it will tell you “oh, this probably came from a chicken (85% confidence)!” in addition, the approach can assign a human infection ‘risk’ score for any livestock salmonella isolate but of course this score is very hard to verify – any volunteers…?