bacterial genome annotation - handling and taking advantage of a multi-million protein sequence space
posted on march 25, 2022 by oliver schwengers
oliver schwengers, phd student at the justus liebig university in giessen takes us behind the scebes of his latest research 'bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification' published in microbial genomics. oliver discusses the analysis of microbial genomics data, resulting challenges and new approaches for the annotation of bacterial genomes for applications within the fields of microbial ecology and medical microbiology.
i’m working in the computationalbio lab supervised by professor alexander goesmann at the justus liebig university in giessen. our group has long standing experience in the analysis of microbial omics data and genome sequences with collaborations within the fields of microbial ecology, biotechnology and medical microbiology.
due to the progress in dna sequencing technologies, bacterial genome sequencing is no longer a big deal – some would even say routine. but to uncover hidden information and answer scientific questions, this data must be properly analyzed. one example is antimicrobial-resistant pathogens that are routinely sampled in various settings. for this purpose, i developed an automated software pipeline for the characterization and analysis of bacterial genomes using wgs data as part of my phd. in this context and for many other post-assembly analyses, a comprehensive and thorough genome annotation plays a crucial role as a common starting ground, and various great software tools exist addressing different use cases with distinct combinations of features, requirements and limitations. however, new challenges emerge from the astounding daily influx of new genomes and steep growth of public databases.
in microbial genomics, we’re facing two diametrically diverging developments. on the one hand, public databases are flooded with highly similar and near-identical protein sequences – including those of utmost relevance, such as amr genes and virulence factors. excitingly, we gain deeper knowledge of them by exploring even the tiniest variations down to single amino acids. however, to cross-link and annotate protein sequences with valuable information from public databases, they must be identified exactly among hundreds of millions of sequences – computationally a very demanding task. on the other hand, researchers worldwide collect and sequence genomes of hitherto unknown species of the so-called microbial dark matter and thus constantly reveal new sequences. hence, for the comprehensive annotation of bacterial genomes, one needs to properly handle both extremes: the exact identification of known sequences, and the functional description of rare or even unknown sequences – both in the order of hundreds of millions.
to address the latter, we follow a classical alignment-based approach using uniprot’s uniref protein clusters comprising millions of sequences that are regularly clustered at different sequence identities, allowing for searches at various homology levels. we compiled a database of cluster representatives clustered at 90% and 50% sequence identity, which at the time of writing included more than 91 million and 12 million sequences, respectively. then, we annotated them using external high-quality resources like pfam, refseq, cog, and assigned public database accessions to foster fair principles. however, aligning thousands of proteins of a typical bacterial genome against millions of reference proteins remains a heavy computational task that is not feasible in a timely manner without larger computational resources.
accelerating these searches and in particular addressing the first extreme, however, is much harder. we somehow need to reduce this giant sequence search space containing hundreds of millions of protein sequences – many of them nearly identical - while still being able to exactly identify each. for this purpose, we tested so-called hash functions that map input data of arbitrary lengths to fixed-size binary fingerprints which are very fast to compute – much faster than sequence alignments. by testing different functions, we found out that even the fairly simple md5 achieves exact sequence identification although reducing overall storage requirements. to take advantage of this, we created a compact, local database with fingerprints of more than 200 million sequences. just like for protein clusters, we assigned annotations and added cross-links to external databases and related clusters. thereby, we achieve exact sequence identifications and fast lookups of related information from this database at storage requirements reduced to ⅓ even though it includes rich annotations like gene symbols, ec numbers, go terms, protein products and external database accessions. interestingly enough, this alignment-free approach also helps to substantially avoid computationally expensive alignments and thus mitigates negative runtime effects of the aforementioned huge protein cluster database.
with access to more data and information than ever before in the history of microbiology, we live in exciting and fascinating times. as a microbial bioinformatician, i love to work at the intersection between microbiology and engineering to provide methods and software tools that hopefully help other researchers to study and take advantage of this giant microbial data.
the annotation of protein sequences and many other genome features is implemented in the open-source command-line annotation tool bakta, recently published in microbial genomics. for any questions about this research project, please contact oliver schwengers at [email protected].