from custom pipeline to scalable software: assembling complete bacterial genomes from long and hybrid reads with hybracter
posted on may 13, 2024 by george bouras
george bouras takes us behind the scenes of their latest publication 'hybracter: enabling scalable, automated, complete and accurate bacterial genome assemblies' published in microbial genomics.
i am george bouras, a bioinformatician and phd student in the ent surgery group at university of adelaide and the basil hetzel institute for translational health research in adelaide, australia. i also work closely with the flinders accelerator for microbiome research at flinders university, also in adelaide. i work on bioinformatics tool development for phage and bacterial genome assembly, polishing and gene annotation. i also apply the tools to analyse wet-lab data generated by my colleagues in areas such as phage therapy, the microbiomes of chronic rhinosinusitis and head and neck cancer and viral metagenomics.
hybracter started because i had to assemble a few hundred hybrid nanopore long- and illumina short-read sequenced staphylococcus aureus genomes generated by my co-author ghais houtak from chronic rhinosinusitis patients (some now published as part of our study in microbial genomics). to do this, like everyone conducting bacterial genome assembly should do, i read the entire back catalogue publication list, blog and tutorial of my co-author ryan wick, who is without doubt the world’s utmost expert in the field of bacterial genome assembly. ryan was then recommending switching to a long-read first assembly with short-read polishing approach for hybrid assemblies. this is in contrast to the prevailing gold standard, his tool unicycler, which conducts a short-read assembly and then uses long-reads to resolve repeats.
however, no one had made an automated long-read assembly tool or pipeline to do this that considered plasmids. the problem, i quickly found out, was that, as summarised by this paper published last year in microbial genomics by johnson et al.: ‘long-read assemblers struggle to assemble small plasmids’. so, you had to do a short-read first assembly with unicycler anyway, which was slow and unnecessarily assembled the chromosome too. and staphylococcus aureus has a lot of small plasmids!
accordingly, i developed hybracter as a pipeline that would separate the assembly of plasmids and chromosome – the plasmids would be assembled using plassembler, a tool i developed for fast targeted plasmid assembly, while the chromosome would be assembled with flye, a fast and accurate long-read assembler. crucially, i managed to figure out, with ryan’s help, that one could accurately recover small plasmids even from long-reads alone if you tricked plassembler (which uses unicycler under the hood) to pretend the long-reads were both short- and long-reads – this “solved” the small plasmid issue. this was a very happy day in the lab! i also built a few other tools and modules, such as dnaapler (for reorienting the chromosome and plasmids) and pypolca, a modern and improved re-implementation of the short-read polisher polca, which was recommended by ryan in his masterful tutorial.
hybracter first began as a local snakemake pipeline for my own use (this earlier iteration is described in the supplements and methods of our crs paper), but it was clear (with some encouragement from ghais) that if i put some more time and effort into software development, maintenance and robustness, it could be widely useful for the microbial genomics community. a real crucial moment here was when my co-author vijini mallawaarchchi gave me a two-hour master-class on software development and testing – i quickly adopted many of the best practices she taught me into hybracter and my other software tools. hybracter therefore quickly graduated from a local pipeline into a fully-fledged software package, with support for long-read only assemblies.
going forward, i hope hybracter becomes a useful tool for the community, particularly as we enter an age of near-perfect and cheap long-reads. under the hood, hybracter leverages the scalability benefits of the snakemake workflow manager to efficiently scale to massive datasets that are increasingly being generated with long-read isolate data, while wrapping it in a simple command-line interface that most users of bioinformatics software are used to. we have also made it easy to use for bioinformatics beginners with options like no-code google colab notebooks.
i particularly think hybracter will be most useful for anti-microbial resistance gene and plasmid research, as it will enable a large expansion of the amount of accurate and complete plasmid sequences. i really think this area will continue to explode as long-read sequencing improves and becomes cheaper. inferring plasmid transmission and epidemiology is much easier (though still not easy!) if you can just get the whole plasmid sequence, rather than making inferences from short reads alone.
eventually, i actually hope hybracter becomes redundant — this probably will be when long-reads are essentially perfect and nearly costless to generate (at least at the bacterial genome scale). we will then be able to apply much simpler algorithms to automatically generate perfect assemblies at scale. the problem will then move onto consistently generating complete genome assemblies from metagenomic data — but one step at a time!