covid-19: a new era for microbial genomics
issue: sars-cov-2 and covid-19
12 october 2021 article
"the covid-19 pandemic has changed the world …” is probably how 90% of the articles about the topic have started since sars-cov-2 burst so unceremoniously onto the scene in late 2019. you will hear no disagreement from me in that regard, but we do need to discuss a significant change that has been heralded by the pandemic, and that is in the field of bioinformatics.
even very early on in the pandemic, it was clear genomics was going to have a big role to play, not only because it was an available technology but also because of its cost effectiveness. make no mistake, sequencing has been around a while now and cost-effective pathogen sequencing certainly isn’t novel by any standards. it is the fact that the skills, knowledge and infrastructure are far more ubiquitous now than they were in the previous decade. this is, i believe, what made all the difference. the paper describing the genome of sars-cov-2 was received by nature on 7 january 2020 and subsequently published on 3 february shortly thereafter. on the face of it, most scientists reading this will be aghast with how quick a publication this was, but there is a far more astounding story within this.
the patient from whom this novel coronavirus genome was elucidated was admitted to hospital on 26 december 2019. within 12 days of admission, the genome of the novel coronavirus responsible for causing the patient’s disease was assembled from a metagenomic sequencing run of a bronchiolar lavage, a paper was written and submitted to nature, describing the genome, illustrating where it sat in the evolutionary framework of other coronaviruses and identifying its possible origins as a spillover event from bats. this was an early marker that genomics and bioinformatics were going to play a huge role in this pandemic.
by march 2020, sars-cov-2 had spread far across the globe. many countries entered lockdowns of some form. on 11 march, the world health organization (who) director general declared that covid-19 was indeed a pandemic. the rapid spread of the pandemic was quickly met with the swift mobilisation of scientists in response. thousands of scientists were committing themselves to the fight against sars-cov-2, whether that be as the labour force processing pcr tests or changing their research focus. importantly, as this is about bioinformatics, many consortia popped up in countries dedicated to one goal: to sequence positive covid-19 samples. the most notable of which popped up in the uk: the covid-19 genomics uk consortium, cog-uk for short.
cog-uk was, and continues to be, a collaborative effort unlike any other. a plethora of scientists working across 22 universities, institutes and health agencies to deliver on the singular goal: sequence, analyse and report on the genomes of sars-cov-2 samples in the uk. at the time of writing, cog-uk has sequenced more than 800,000 sars-cov-2 genomes, meaning it has sequenced more genomes per head of population than any other on the planet. more importantly, and what, for me, is the true cultural shift in this field, is the rapid archival of these data publicly for use by the wider scientific community.
before i continue, i should be clear that i currently work as a bioinformatician with the european nucleotide archive (ena) at embl’s european bioinformatics institute (embl-ebi) as part of the team responsible for the archival and mobilisation of sars-cov-2-related data. i can offer my perspective as someone working with the immense volume of data and give a professional view as to why i feel this pandemic has been such a major cultural shift for the microbial genomics community and bioinformatics.
there are currently two primary routes to archive your sars-cov-2 data as a scientist: gisaid and the international nucleotide sequence database collaboration (insdc); consisting of the european nucleotide archive [ena], national center for biotechnology information [ncbi] and dna data bank of japan [ddbj]). the latter of the two will be well known to many scientists, as the insdc has been archiving digital biological data for many years. gisaid is a newer service that initially focused on the rapid sharing of influenza data but adapted to also take in sars-cov-2 genomes when the pandemic struck. initially, genomic consortia responsible for sequencing the virus submitted primarily to gisaid, mainly due to their strong desire to get genomes available to the wider community as fast as possible, which is a key advantage of gisaid as a service. this enabled many decisions early in the pandemic to be informed by genomics and helped get a sense of how the pandemic spread across the globe. the rapid sharing of genomes allowed the rapid development of tools for genome analysis, such as the pangolin lineage assignment tool and cov-glue for exploring amino acid mutations, and genomes could be picked up and analysed by services like nextstrain. all of these tools are now ubiquitous in the fight against the pandemic, all powered and enabled by rapid sharing of genomes.
however, gisaid currently only archives assembled genome sequences, which gives limited options for researchers looking to do more in-depth analyses of the sars-cov-2 data or those who would like to control how those genomes were constructed from the raw data. this is where the insdc is of enormous importance to researchers, as you can also archive raw read data and sequences, along with a myriad of other data types.
for context, at the time of writing (august 2021), sars-cov-2 reads make up 7.7% of all the raw reads archived within the ena (by accession, not in terms of storage), which i think contextualises just how enormous the sequencing efforts by the uk and the rest of the world have been. so important is this flow of data, a consortium of european partners including embl-ebi, elixir, erasmus mc, elte and dtu have a dedicated service for exploring sars-cov-2 data, the european covid-19 data platform. even more astonishing is the fact that the median turnaround time from sample collection date to the raw data going public in the ena is 25 days at its peak. there is an extraordinary amount of manpower involved in this feat, one that we should be celebrating.
the advantage of archiving these data, especially with the insdc, is that it will be archived to provide the maximum utility to any scientists wishing to repurpose the data for as long as possible. however, in order to provide the maximum utility of the data, the metadata surrounding each of the samples needs to be as rich as possible. at a bare minimum, collection date, species and geography are the metadata of utility in public health genomics, but much of the data is capturing more than this. ct values from qrt-pcr runs being captured, or the number of days hosts were symptomatic for are just two examples of the kind of detailed metadata i’m talking about. this, for me, is where we are witnessing the true culture shift in public health genomics. yes, the volume of data being produced is a feat to behold. but there is great depth to the metadata, and this is what makes the public data so useful. not just now, but long into the future, scientists will be able to analyse these data in detail and we, as a society, will be learning from this pandemic for years to come. all thanks to the bioinformaticians and scientists across the globe who have been committed to sequencing, archiving and making available all the data.
this is the unsung cultural change in the science of the pandemic. we are heralding a new era of open science. we know we can sequence pathogens at scale. we know we can archive and contextualise genomic data robustly and quickly. now we just have to keep doing it, because in this new era of open science, robust, public and accessible, digital, biological data will be the key to fighting infectious diseases.
further reading
international nucleotide sequence database collaboration
gisaid
european nucleotide archive
national center for biotechnology information
dna data bank of japan
cov-lineages
cov-glue
nextstrain
colman o’cathail
embl-ebi, wellcome trust genome campus, hinxton, cambridge cb10 1sa, uk
@colmanoc
uk.linkedin.com/in/colmanocathail
colman currently works as a bioinformatician at the ena dealing with sars-cov-2 data. he has been a member of the society since 2014 and is currently chair of the ecm forum executive committee and a member of council.
could you describe one of your typical workdays?
typical feels so unjust to use as i think my workdays can be very varied. but broadly speaking, my workday would include an internal team meeting or two relating to data archiving or analysis at the ena. generally, i would usually meet with our international partners across europe or even further afield, which is one of my favourite parts of working at embl-ebi! in between, i’ll be working on helping users submit data, delivering support for ongoing sars-cov-2 work and collaborating on coding projects within the ena team.
which parts of your job do you find most challenging?
i think the most challenging part of my job can be being stumped by coding problems. although, that incidentally makes it one of the more satisfying parts of my job too! hard to beat the feeling of solving a problem that’s stumped you for a while.
image: martin krzywinski/science photo library. data visualisation of the genomes of the 56 fully sequenced isolates of the virus sars-cov-2. .