there is no bad data, there is only data

posted on september 13, 2024   by dr. inês mendes and dr. bryan a. wee

dr. inês mendes and dr. bryan a. wee take us behind the scenes of their latest publication 'pha4ge quality control contextual data tags: standardized annotations for sharing public health sequence datasets with known quality issues to facilitate testing and training' published in microbial genomics.

mgen.png

inês and bryan are members of the public health alliance for genomic epidemiology (pha4ge), a global coalition of public health, academic and industry professionals who aim to document and develop best practices in using genomic sequence data for public health epidemiology.

dna sequencing is becoming increasingly accessible with the availability of platforms that require a lower initial investment to get started, as well as platforms that generate higher throughput, allowing for lower per-sample costs. nevertheless, substantial time and money are still required to generate sequence data, and occasionally, the data produced is not always good enough to perform all types of analyses. this can be compounded by the absence of global standards that determine what constitutes ‘good’ data, as these thresholds are often set within organizations based on their own specific needs.

even when a sequencing run is less than ideal, the sharing of these suboptimal datasets can still aid fields such as public health. for example, whole genome sequences of sars-cov-2 that have insufficient coverage of the whole genome can still yield valuable information about the spike protein. bacterial genomes that were not sequenced deeply enough for phylogenetic analyses could still be used to gain some understanding of its antimicrobial resistance phenotypes. these data can be useful to the community, especially if accompanied by high quality metadata. sub-optimal data can also be used for training purposes and used to test bioinformatic tools. this can help us understand the range of possible scenarios that could occur with real world data which cannot be easily replicated by data simulated in silico.

in our recent microbial genomics publication (june 2024) entitled “pha4ge quality control contextual data tags: standardized annotations for sharing public health sequence datasets with known quality issues to facilitate testing and training” [1], we show how these so called ‘bad data’ can be made more valuable by tagging them with information about their potential issues. our proposed way of annotating these data, called qc tags, provides a systematic way of tagging these data with known quality issues, to allow data generators to communicate with data users. qc tags also allow users to search and filter datasets with quality issues, to include and exclude them from their workflow as desired. our approach leverages ontologies to allow for consistent descriptions of issues that can plague sequence data, aiding reproducibility and clear understanding by data users.

in text image dr. inês mendes and dr. bryan a. wee blog.png
© timme et al. 2022 microbial genomics the pathogen data object model (dom) illustrates the key elements of a comprehensive pathogen package. this package primarily includes raw sequence files along with their related contextual information housed in biosample, bioproject, and the raw data archive. the package also contains assemblies and/or consensus records, which the submitter may provide or may be automatically generated by the insdc repository depending on the specific organism. adapted from: timme et al. 2022 mgen

when uploading sequencing data to publicly-available data repositories such as the international nucleotide sequence database consortium (insdc), which includes databases such as short read archive (sra), biosample and genbank, qc tags can be added where appropriate. however, when the organization of this data is not standardized, it can be difficult to know where this information can be found. this impacts the utility of data found in public sequence repositories. the pathogen data object model, proposed in the manuscript published in microbial genomics in december 2023 entitled “putting everything in its place: using the insdc compliant pathogen data object model to better structure genomic data submitted for public health applications” [2], unlocks the usability of data. this allows it to be used not just for research purposes, but also for surveillance and public health decision-making by inferring epidemiologically-relevant events based on robust contextual metadata. this common pathogen data structure formalizes the minimum pieces of both sequence and contextual data necessary for actionability.

there is power in data, but a lot of that power comes from being able to place it into the right context. the purpose of sequencing, the sample source and condition and a clear description of methods applied can make data more useful, regardless of its quality. so go ahead, and share your data, even if the ideal scenario was not met. with proper annotation, someone’s bad-quality data might represent an invaluable training opportunity or a tool for the validation of new methods.


references

1. griffiths ej, mendes i, maguire f, guthrie jl, wee ba, schmedes s, et al.. pha4ge quality control contextual data tags: standardized annotations for sharing public health sequence datasets with known quality issues to facilitate testing and training. microbial genomics. 英格兰vs美国谁会赢? ,; 2024; doi: 10.1099/mgen.0.001260.

2. timme re, karsch-mizrachi i, waheed z, arita m, maccannell d, maguire f, et al.. putting everything in its place: using the insdc compliant pathogen data object model to better structure genomic data submitted for public health applications. microbial genomics. 英格兰vs美国谁会赢? ,; 2023; doi: 10.1099/mgen.0.001145.