Blog @ Illumina
Real scientists. Real commentary.

The Power and Promise of Population-Scale Genomics

Sean Humphray, Director, Scientific Research at Illumina
| Jan 14, 2014

Whole-genome sequencing is the most comprehensive approach to cataloging an individual’s genetic constitution, capturing all of the variants present, including important noncoding regions, in a single assay. As a result, cost-effective, highly accurate human whole-genome is poised to become a mainstay of many medical applications.  

Genetic variation influences almost all human diseases. Germline (inherited) variants cause or contribute to genetic disease or predisposition to rare and common diseases. Somatic (acquired) variants are an additional cause of onset and evolution in cancer. Choosing from many, three examples of advances aided by accurate, high-throughput sequencing include monogenic conditions, complex diseases (notably cancer), and pharmocogenetics.

More than 3,500 monogenic diseases have been characterized, and are frequent causes of neonatal morbidity and mortality, with presentations that are often undifferentiated at birth. Whole-genome and whole-exome sequencing have been used to identify rare, high-penetrance disease variants in monogenic conditions, and in examples of previously undiagnosed paediatric conditions.1-3

Genetic variation can play many roles in common, complex diseases such as cancer, heart disease, and diabetes. Identification of genetic risk factors in individuals may attribute risk, indicate follow-up tests or lifestyle changes, or form the basis for further research to assess the true predictive value of such variants.

Genome sequence information can pinpoint pharmacogenetically important variants that influence drug response.  Variants such as this can assist in managing drug dosage, efficacy, and adverse effect risk. Genomic data can be used to distinguish forms of bio-identity, including blood groups, tissue typing, or HLA, with implications for lifetime use in transfusion, transplantation, and immune defects.

From Sample to Answer – Making Genomes Really Useful

Much attention has been focused on recent, rapid developments in next-generation sequencing (NGS) technology that provided dramatic, log-scale improvements in throughput and cost. However, as whole-genome sequencing scales to the population level, speed and economy of data processing become the most important drivers for development.  HiSeq X Ten, a set of 10 HiSeq X Systems, opens the door to true population-level genomics by generating up to 18,000 whole human genomes per year. But as impressive as reshaping the economics and scale of human genome sequencing may be, it is only part of the story; making the data accessible and decipherable so as to be truly useful is the rest. Improvements to NGS have been made alongside increasingly efficient approaches to data analysis. Comprehensive software tools such as BWA+GATK and ISAAC for fast alignment and variant calling are continually being improved for greater accuracy and efficiency.  The  Isaac  Genome  Alignment  and  Isaac  Variant  Caller is a fast and accurate whole genome resequencing workflow that begins with BCL or FASTQ files and produces BAM and VCF files in just over 7 hours for a 30× genome on commodity computing hardware. Continued population-scale genomics require minimal storage requirements.  An assembled 30× genome is currently stored in a BAM file of ~100 Gb. The efforts of many researchers and bioinformaticians over the past two years have achieved a 30% reduction in this figure. For example, whole-genome data quality score (Q-value) bins can now be merged from 40 down to 8 separate bins without sacrificing either score accuracy or performance. Other community efforts to reduce the data footprint further, (such as CRAM, with tunable, reference-based compression) are underway.

Depending on the sequencing application, the important dataset is usually not the genome assembly itself but the consensus sequence and variants that comes from the assembled genome. These data come from the BAM file, and can be collected into genome variant call format files (.gvcfs) that occupy only a fraction of the BAM file (~2 Gb). The utility of genomic data is realized through annotation, leveraging public databases (e.g., Ensembl, GenBank, 1000-genomes, ENCODE) and using public tools such as the Ensembl Variant Effect Predictor. Annotations can be associated with an individual .gvcf and add very little to the data storage requirement but a great deal to the biological relevance.

Sequencing Populations

While population groups share many common variants, each population group has also accumulated its own set of variants. Events such as mutation, random drift, and natural selection generate diversity, and a subset of variants naturally affects gene function. If the resulting effect is potentially or actually deleterious to the health of individual, these variants can either directly cause disease or contribute to increased risk. Having tens of thousands of genomes, or so-called “factory-scale” sequencing technology enabled by HiSeq X Ten, will revolutionize the study of population diversity, and can help us understand the genetic basis of this risk. Implementing sequencing programs on a national scale will speed genomic information integration into medicine, improve standard of care with the potential to bring about major savings in healthcare economics. Aggregation of individual genome sequences with clinical and other phenotypic information will empower both researchers and clinicians as they seek to bring to fruition the potential of precision medicine.

Aggregating All the Information

An enormous benefit of introducing large-scale sequencing into clinical research and ultimately healthcare is bringing genomic and phenotypic information together and using it to find new associations, advance our understanding of the functional consequences of DNA mutations, and improve our ability to diagnose and predict outcome of disease for individual patients. As datasets expand, it becomes possible to use the aggregated genomic and patient outcome datasets for example through the use of the NextBio approach. This use of big data technology will enable users to systematically integrate and interpret public and proprietary molecular data and clinical information from individual patients, population studies and model organisms, thus applying genomic data in novel and useful ways, both in research and in the clinic


1. Hg SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, et al. (2009) Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461(7261):272–276.

2. Worthey EA, Mayer AN, Syverson GD, Helbling D, Bonacci BB, et al. (2011) Making a definitive diagnosis: successful clinical application of whole exome sequencing in a child with intractable inflammatory bowel disease. Genet Med 13(3):255–62.

3. Saunders CJ, Miller NA, Soden SE, Dinwiddie DL, Noll A, et al. (2012) Rapid whole-genome sequencing for genetic disease diagnosis in neonatal intensive care units. Sci Transl Med 4 (154).