Next-generation sequencing technology has made leaps and bounds in recent years, improving read length, speed, and accuracy. Despite these advances, de novo assembly still faces challenges, especially for human and cancer genomes. Resolving complexities like repetitive regions and gene duplications is crucial to creating an accurate picture of the genome. To this end, innovators like Dr. Daniel Zerbino of the University of California at Santa Cruz are developing bioinformatics methods to streamline data analysis and provide a comprehensive look at complex genomes.
Zerbino, of Velvet fame, compares de novo assembly to shredding the novel Moby Dick and piecing it together again—placing whole-page excerpts is easy, but if the fragment is a single word like “the,” it is hard to tell where the word belongs and how many times it appears in the novel. Similarly, repetitive regions and multiple gene copies complicate de novo assembly, and long reads are needed to accurately place sequences in the genome. Current methods simplify this problem by providing a compressed representation of the genome, so a repeated sequence is reported only once. The limitation when “detangling” these regions is that one repeat sequence can have many possible extensions, so analysis can sometimes get lost.
The problem becomes even more complex with cancer genomes, and ploidy becomes crucial—if a gene has two mutations, are these mutations on the same copy or on unrelated copies of the gene? To complicate matters further, cancer tissues are almost always heterogeneous. Samples contain a mixture of diverse cancer cells, DNA contamination from healthy cells, and material from immune cells reacting to the tumor. How do genomics analysis methods distill all that extra “noise” into an accurate reflection of the cancer genome?
To meet these challenges, Zerbino and colleagues developed a mathematical framework that resolves confounding issues and expresses them as a history, telling biologists whether multiple complex changes in the genome were the result of a single event. The analysis pipeline works to annotate deletions, breakends (pieces of the genome that are distant from each other in the reference, but are close in the patient’s genome), and complex rearrangements. According to Zerbino, the hope is that scientists will be able to infer the state of the early cancer genome as it began to expand, distinguishing early mutations that were potential drivers of tumorigenesis from those that occurred later.
The pipeline also solves the longstanding problem of heterogeneity in cancer samples. Until now, pure samples from cancer tissues were a requirement. However, because a tumor is a population of cells, information about the surrounding environment creates a “cancer metagenome.” The pipeline leverages the noise of heterogeneous cancer samples—once an obstacle to cancer analysis—to extract additional information about the cancer environment. Zerbino’s method has significant implications for metagenome assembly as well, with potential to effectively study diverse communities such as those of the human microbiome.
Dr. Zerbino anticipates that ideally, sequencing and analysis will identify oncogenes that are amplified in a patient, and physicians will receive a list of affected genes to inform treatment. Although there’s more work to be done before this method can be put into clinical practice, this new analysis pipeline promises to shed light on the structure and evolutionary history of the cancer genome by converting DNA reads to biological events.