Blog @ Illumina
Real scientists. Real commentary.

ESHG 2015- Day Two Summary

by
Scott Brouilette, Ph.D.
| Jun 08, 2015

The organisers clearly decided that the best way to ensure everyone was up and at the 8:30 sessions on a Sunday morning was to roll out the big guns - and so the Big Data Genomics sessions got underway with the one-and-only Dan MacArthur. Dan introduced us to the Exome Aggregation Consortium project, ExAC (for Dan’s overview of the ExAC please see this link). The motivation for the project is simple; making sense of one genome requires the study of 10,000s of genomes. To date, more than 500,000 exomes and 50,000 genomes have been sequenced, but the data are often disparate, having been generated using different informatics pipelines, and are frequently siloed. This highlights a particular issue when attempting to bring multiple genomic datasets together in a meta-analysis. The solution is relatively simple- rather than collecting the aligned data or variant call files from multiple projects, go back to the FASTQ files and reanalyse everything through a single pipeline. Such an approach requires considerable storage and compute power, but that is precisely what the ExAC project has done: 92,000 exomes from projects all around the world, re-aligned and variants called using the GATK 3 Haplotype Caller pipeline. Dan proudly stated that the size and diversity of ExAC is an order of magnitude greater than anything else currently available. Analysis of this dataset has yielded 10 million variants (equating to one variant for every 6 base pairs), and 200,000 candidate loss-of-function alleles affecting more than 15,000 protein-coding genes. Importantly, the size and ancestral diversity of the project have significantly increased the power to detect rare disease variants. All data is publicly available at the ExAC site, which has received an impressive 1.1 million page views in just the last few months. The project will move forward by adding more exomes, moving to whole genomes, and creating more user-friendly tools to query this massive data set.  

So from one big project to another: Patrick Sulem from deCODE realyed findings from their recent, very high-profile publication. Iceland has long been an almost perfect place to study human genetics owing to its relatively small number of inhabitants (320,000), very detailed genealogical records, and high quality universal health care. The nature of the cohort enables long-range haplotyping and imputation enables a MAF of 0.1% to be detected. Patrick defined this as rare, leading to some on Twitter to ask for a consensus definition of a rare variant. He then recapped a few of the key highlights from the aforementioned publication; for example 1:13 subjects is a rare, complete human knockout, with 15% of those individuals being a compound heterozygote. 

In the final session, Aarno Palotie discussed another population which, like Iceland, benefits from having had a small number of founders coupled with limited immigration. In 1670 there were just 40 families in Finland, which increased to 18,200 by 1995. The data from this project forms part of the SiSu project.

While many researchers have their favourite “-ome” to study, increasingly the integration of “omics" is required to elucidate the mechanisms underlying disease. Arthur Ko (UCLA) started by discussing obesity- a condition with 40-80% heritability but in which GWAS explain less than 3%. So what else contributes to the condition? Many of the GWAS-identified variants to date appear to have a regulatory function, so Arthur and his team are working on the hypothesis that these variants are perhaps only activated in a specific context. They used the Finnish METSIM cohort to look for context-specific expression quantitative trait loci (eQTLs) using RNA-Seq data from 566 subjects. Several examples relating to obesity were identified and replicated in both blood and adipose tissue; expression of the top 65 eQTL genes explained around 6% of the variance across all metabolic traits, mediated by or targeting oestrogen pathways. DV Zhernakova from the BIOS Consortium continued the theme of “contextual expression”, this time in a Dutch cohort of 2166 samples. The data, subsequently replicated in Geuvadis, revealed that 80% of expressed protein-coding genes are cis-regulated with multiple SNPs often affecting the same gene and many of the eQTLs being in LD with previous GWAS hits. Furthermore, when the data is separated into individual cell types, there are clear cell-specific effects, indicating that genetic variation in gene expression depends on contexts such as cell type or stimulation. 

The final talk of the session from Alexandra Zhernakova (Groningen) was on “integrated” human and bacterial genomes in relation to BMI and blood lipid profiles. Alexandra reminded us that the number of bacterial cells in the human body is 10 times greater than the number of human cells before asking which intestinal bacteria impact human lipid metabolic profiles, and whether host genetics have any role in the process. Using LifeLines Deep, a unique cohort with multiple phenotypic measurements, the group used 16s rRNA profiling to show that both age and gender have a strong effect on microbiota composition and diversity. But after adjusting for these, 34 taxonomies were associated with BMT, TG, and HDL, while one taxonomy associated with LDL. There was no significant correlation between lipid and BMI SNPs and the microbiome, but the microbiome does explain up to 6% of the blood lipid level variance.

Data sharing had been touched on in day one by Matt Hurles, so it was unsurprising to see that the "Data Sharing Initiatives" session proved very popular, with standing room only despite the considerable room capacity. This perhaps suggests that the issue is uppermost in the minds of many groups around the world as the amount of genomic data available continues to increase. Helen Firth (Sanger) explained how DECIPHER is helping researchers and clinicians around the world. DECIPHER was set up in 2004 and has resulted in over 700 publications since 2011. To date 17,000 anonymous records have been shared publicly, >18,000 shared to consortia including 43 countries, 25+ projects and 1400+ registered users. DECIPHER’s strength is that it is comprehensive (SNVs, CNVs, nuclear and mitochondrial), and dynamic, offering a “real-time” representation of the genome. Helen made that point that data-sharing databases have to be dynamic, particularly in a clinical setting, to avoid fossilising the interpretation of a gene to the date it was deposited.  DECIPHER can be queried by phenotype or gene name (including novel genes), with no login required. A question from the session chair was "What is the major bottleneck in relation to sharing?” It seems that in the research field the main fear is of losing “exclusivity” and thus the ability to publish. Helen stated that she knows of many groups that found the opposite- by sharing data and taking a more collaborative approach, they were able to publish more impactful papers than those with isolated data.  

Comment

  1.