Blog @ Illumina
Real scientists. Real commentary.

MutSigCV: The Magnet That Helps Pull That Needle Out of the Haystack

Maria Celeste Ramirez, Ph.D.
| Aug 07, 2013

Next-generation sequencing (NGS) has fast become a mainstream tool in cancer research but its advantage is also its pitfall. The real challenge in cancer genomics is having the sensitivity required to be able to identify variations of low frequency within a highly heterogenous sample. In this respect, next-generation sequencing is unparalleled. The downside to having such great sensitivity however is having to plow through all the information that is gathered. Tumor profiling with NGS identifies hundreds of thousands of mutations for every sample sequenced. As with any study on complex multifactorial diseases, the strength of the study is in the numbers and therefore large sample sizes are required in order to lend significance to the findings. This results in the challenge of finding biological significance for mutations in the order of millions, creating the informatics bottleneck that everyone doing NGS-based cancer genomics is familiar with.

In this paper, the authors observed that there were a good number of genes implicated in cancer studies whose potential involvement in disease pathogenesis is not obvious based on their biological function or properties, suggesting a high number of previously unrecognized false positive associations. This led them to hypothesize that mutational heterogeneity contributes significantly to the increased background noise within these datasets. They set out to prove this by interrogating samples from a cohort of more than 3,000 patients, analyzing tumor-normal pairs across 27 tumor types to better understand the natural processes that lead to mutation outside the disease context. In this effort, they aimed to enhance true signals from pathogenic mutations by minimizing signals from other variants unrelated to disease.

Intratumor heterogeneity has been well-described in cancer literature and has served as the main driver for developing tools and protocols that enable increased sensitivity. In addition, the ability to confidently call variations of low frequency within a mixed pool has been the Holy Grail for those that develop variant calling algorithms. On top of this, each cancer patient has developed their own path to disease based on their genetic predisposition as well as other environmental factors to which they have been exposed to. This greatly expands the set of mutations and genic regions that could potentially be implicated in disease pathogenesis. The problem then becomes: In this large collection of suspect genes, how do you tease out which mutations are drivers, which are passengers, and which are just naturally hypervariable and have no impact in the onset or progression of disease—that is, how do we find that needle in the haystack of noise?

As they delved deeper into the story, the authors observed key themes that supported their hypothesis. First is that the median frequency of non-synonymous mutations across different cancer types were highly variable, with a range that spanned three orders of magnitude. Furthermore, for patients with the same cancer type, mutation frequencies were also highly variable. Second, the nature of these nucleotide changes show great variation as well and interestingly, the distribution and frequency of mutations from different cancer types reflected their natural groupings based on known predisposing factors. And third is that certain genomic factors, such as gene expression levels and replication times, are highly correlated with mutational frequency within a tumor and across tumors, even those of different cancer types.

With the understanding of what constitutes “background” events, the authors developed MutSigCV, an integrated approach to identifying mutated genes highly associated with disease—not just highly mutated genes in disease samples. This approach takes into consideration these other processes they have observed, unrelated to disease, that lead to nucleotide changes. By subtracting these events from the total list of potential disease-related variants, a more defined list of genes that have real associations to the phenotype is obtained.

This study is extremely valuable in extending our understanding of mechanisms that result in mutations in both the normal and diseased states. By integrating this understanding in the identification of genes implicated in cancer (and drastically decreasing the number of rows in Excel spreadsheet that results from these studies), the authors have successfully reduced the computational and analytical effort needed to validate these findings thereby helping accelerate cancer research—to Dr. Getz and the rest of the group, our laptops (with limited RAMs) thank you!

Related Content:
Cancer genomics 
Cancer data analysis 
Deep sequencing and cancer 
Cancer sequencing methods
Somatic mutations