A new algorithm, developed and tested by researchers at New England Biolabs (NEB®), reported today in Science, suggests this problem is widespread among DNA sequencing studies. The researchers estimated that 41 percent of the sequencing runs from the 1000 Genomes Project and 73 percent of the runs addressed in The Cancer Genome Atlas show signs of DNA damage from library preparation. More specifically, one-third or more of G-to-T mutations in reads may be false positives. Other variant calls are affected too, but less often.
"The extent of the damage in public databases surprised us. While this is very unlikely to affect the conclusions from those initiatives at the population level, these artifacts may limit the ability to assess tumor heterogeneity and the drivers of relapse in individual cancer patients. Understanding this damage offers the research community an opportunity to enhance diagnostics and accelerate cancer research," said Laurence Ettwiller, Ph.D., a research scientist at New England Biolabs and co-corresponding author on the study.
"The potential upside is that by removing these artifacts, we have the opportunity to lower the threshold for detecting bona fide somatic variants. Driver mutations among small tumor subpopulations may become much easier to identify than they have been thus far," she continued.
Most sequencing errors have been traditionally thought to result from polymerase chain reaction (PCR) mistakes, miscalls, or degraded samples — such as formalin-fixed paraffin-embedded, ancient, or circulating tumor DNA.
"A key source of poor sequencing data quality now appears to be damage during library preparation, even for fresh DNA samples. For the next wave of genomics that is targeting increasingly low frequency mutations, success will be determined not only by maximizing library yield, but also by ensuring data quality," said Tom Evans, Ph.D., the second co-corresponding author on the study and scientific director for New England Biolabs' DNA Enzymes Division.
This research was conducted as part of a broader effort to understand the potential sources of sequencing errors and how these affect the integrity of sequencing results. Understanding these errors and their sources will permit the development of repair and workflow solutions to improve sequencing accuracy.
In the meantime, potentially problematic sequencing runs can be flagged during routine quality control using Ettwiller's open-source algorithm for estimating damage during library preparation.
Chen, L., Liu, P., Evans, T. C., & Ettwiller, L. M. DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification. Science. 17 Feb 2017: Vol. 355, Issue 6326. DOI: 10.1126/science.aai8690
To view the original version on PR Newswire, visit:http://www.prnewswire.com/news-releases/new-publication-highlights-dna-damage-as-a-prevalent-source-of-errors-in-public-cancer-databases-300409549.html
SOURCE New England Biolabs