New Publication Highlights DNA Damage as a Prevalent Source of Errors in Public Cancer Databases

Addressing this DNA damage could improve the detection of low-frequency disease variants, New England Biolabs® researchers report today in Science

News provided by

New England Biolabs

Feb 17, 2017, 10:27 ET

IPSWICH, Mass., Feb. 17, 2017 /PRNewswire/ -- If only one in every hundred cells in a small tumor had a mutation-conferring resistance to chemotherapy, thousands of cancer cells would remain after treatment. Recently, genomics research has focused on finding these mutations in ever-smaller subpopulations of tumor cells — or even from circulating tumor DNA in blood — to guide treatment decisions and potentially identify new oncology therapeutics.

Unfortunately, however, detecting these genetic anomalies may be more difficult than anticipated. Routine protocols to prepare DNA for sequencing unexpectedly appear to be inducing damage in otherwise fresh, high-quality DNA samples. The resultant artifacts, due to their pattern and frequency, are confounding the detection of the "true", low-frequency mutations that are critical to understanding the genetic diversity of tumors and the drivers of recurrence in cancer patients.

A new algorithm, developed and tested by researchers at New England Biolabs (NEB^®), reported today in Science, suggests this problem is widespread among DNA sequencing studies. The researchers estimated that 41 percent of the sequencing runs from the 1000 Genomes Project and 73 percent of the runs addressed in The Cancer Genome Atlas show signs of DNA damage from library preparation. More specifically, one-third or more of G-to-T mutations in reads may be false positives. Other variant calls are affected too, but less often.

"The extent of the damage in public databases surprised us. While this is very unlikely to affect the conclusions from those initiatives at the population level, these artifacts may limit the ability to assess tumor heterogeneity and the drivers of relapse in individual cancer patients. Understanding this damage offers the research community an opportunity to enhance diagnostics and accelerate cancer research," said Laurence Ettwiller, Ph.D., a research scientist at New England Biolabs and co-corresponding author on the study.

"The potential upside is that by removing these artifacts, we have the opportunity to lower the threshold for detecting bona fide somatic variants. Driver mutations among small tumor subpopulations may become much easier to identify than they have been thus far," she continued.

Most sequencing errors have been traditionally thought to result from polymerase chain reaction (PCR) mistakes, miscalls, or degraded samples — such as formalin-fixed paraffin-embedded, ancient, or circulating tumor DNA.

"A key source of poor sequencing data quality now appears to be damage during library preparation, even for fresh DNA samples. For the next wave of genomics that is targeting increasingly low frequency mutations, success will be determined not only by maximizing library yield, but also by ensuring data quality," said Tom Evans, Ph.D., the second co-corresponding author on the study and scientific director for New England Biolabs' DNA Enzymes Division.

This research was conducted as part of a broader effort to understand the potential sources of sequencing errors and how these affect the integrity of sequencing results. Understanding these errors and their sources will permit the development of repair and workflow solutions to improve sequencing accuracy.

In the meantime, potentially problematic sequencing runs can be flagged during routine quality control using Ettwiller's open-source algorithm for estimating damage during library preparation.

Chen, L., Liu, P., Evans, T. C., & Ettwiller, L. M. DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification. Science. 17 Feb 2017: Vol. 355, Issue 6326. DOI: 10.1126/science.aai8690

SOURCE New England Biolabs