Blog @ Illumina
Real scientists. Real commentary.

Friday at AGBT Re-Cap

Abizar Lakdawalla, Ph.D.
| Feb 17, 2012

Wow! What an amazing day! Incredible science being generated from the ENCODE project to the much-anticipated announcement from Oxford Nanopore Technologies. Let me get right on it with brief recaps of the talks (in chronological order)…

Rick Myers at the HudsonAlpha Institute led the charge promising us 40 or so papers in 2012 from the ENCODE project. He started of recollecting his collaboration with Barbara Wold to study protein – DNA interactions, with a story about NSRF, a repressor that binds to 2,000 sites in genome. Rick's lab has used  ChIP-Seq on over 80 transcription factors on multiple cell lines as part of the ENCODE effort. The results indicate that occupancy of proteins at a specific DNA site is dependent on the allele. For example, GABP binds only if the G allele is present. He then moved onto RNA-seq, describing it as the optimum method to study gene expression due to the high sensitivity (40 million reads gives 40 reads for 1 kb long mRNA present as low as 1 mRNA per cell). They have developed a new method for preparing RNA libraries based on transposon-hopping (Tn- RNA Seq) that is easier and requires only 10 pg mRNA (1 ng total RNA). Integrating ChIP-Seq, RNA-Seq and DNA methylation sheds better light on gene regulation. Rick's lab used reduced representation bisulfite sequencing (RRBS)and the Illumina 450K methylation array on DNA from 82 cell lines and tissues to discover the top 5% most variable CpGs. Genetics has a strong effect on DNA methylation. In families with whole genome sequencing data, 8% of the methylation sites seemed to show allele-specific imprinting, 92% of the sites were defined by genetics. An example based on KT Varley's work on a breast cancer therapeutic, anti- death receptor 5 gene, showed a strong association of methylation of 114 CpGs with very striking differences in sensitivity and resistance to antibody therapy. He then showed that the correlation of methylation and gene expression is highly dependent on the location of the CpGs.  Rick set the stage for the next talk, which was a genuine eye opener.

Tom Gingeras from Cold Spring Harbor started off with an incredible summary of the transcriptome as discovered by ENCODE and other projects. His presentation speed was approximately 28% slower than the speed of light, so these notes may have some errors. He reported that there are 51,082 genic regions with 161,375 transcripts. 20,684 are protein coding genes. More than 76,000 non-coding transcripts have been reported. Most gene loci code for 8 transcripts on average with more than half of the isoforms being non-coding. The number of genes  has increased by 41% since 2004. The number of transcripts has increased 52% since 2008. Guess most of the text books will have to be re-written. He then went even more granular, about the spatial localization of RNA within a cell; cytoplasmic RNA, nuclear RNA (nucleoli, nucleoplasm, chromatin). Could have been interesting if he had gone into RNA isolated from specific cell compartments but instead moved onto sequencing of long (> 200  nt RNAs, 200 M reads x 2 replicates) and short RNA (< 200 nt). They used an IDR (irreproducibility detection rate) metric to define signal and background noise thresholds especially at low expression levels. Sequencing of these cell lines resulted in the detection of 72% of all known splice junctions. The overall percent of genome covered by all the transcripts was 80% (no that is not a typo). More than half of 247,685 un-annotated single exome transcripts were intergenic or  antisense. The intergenic regions contained about 50,000 transcripts. Novel multi-exon transcripts were seen. Because of the discovery of these transcripts, sizes of intergenic region have shrunk to < 10,000 bp.  Many of these transcripts were validated by RT-PCR followed by 454 sequencing. An individual cell will express multiple isoforms, some isoforms being dominant within that cell. The minor isoforms seemed to be limited to specific compartments of the cell. Many small RNA map to same region as the long intergenic transcripts.  And so on …

Geoff Smith from Illumina talked about sequencing in a clinical setting. First he brought us up to speed on the MiSeq (increase output from 1.5 to 7Gb, and lengths of 2 x 250 bp!) and then showed the power of whole genome sequencing of bacteria by sequencing of supposedly identical tuberculosis strains with a clear identification of mutations in five genes which made one strain multiply drug-resistant and the other not. He then went on to talk about evaluating MRSA (multiply resistant Staph aureus) which caused 94,360 invasive infections in the US with 1800 deaths (think these numbers were for 2005). The rest of the talk was embargoed due to pre-publication data, but the take home was that sequencing data shows far greater discrimination power than traditional tests.

Kevin Hrusovsky from Caliper presented a set of slides about 100 year olds running marathons, obesity, corn based foods, diabetes, metastatic melanoma drugs and lots of statistics about health, medicine, and energy. Other topics he covered in his far-ranging talk: Geospiza, Labchip, robotics, companies that Perkin Elmer and Caliper had acquired, biomarkers for animals and humans, CTC, qPCR, iHealth dashboard, and cancer patients including Steve Jobs. He ended with an invitation to join him for a run in the morning.

Carlos Bustamante from Stanford used data from ancient human, Complete Genomics and 1000 genome projects for population genetic inference. He provided statistics about Complete Genomics, which has completed only 3,800 genomes by the end of 2011 and has made 69 genomes available as a reference (with plans for 500 more genomes). With 3 errors in 1 Mb, CGI data compares favorably with other NGS systems.  CGI samples are American admixed genomes as well as other ethnicities to provide high values for determining diversity. From the data, some races exhibit 2x greater genetic diversity. Europeans show higher amino acid diversity, Chinese have 50% more 'probably damaging homozygous' variation. Excess deleterious homozygous mutations may originate from one or more severe population bottlenecks followed by a population expansion. The remaining presentation on the analysis of Otzi the iceman's genome was under embargo (I think  - as my auditory capabilities were slightly compromised by this time).

In a highly anticipated talk, Clive Brown described the Oxford Nanopore Technologies platform. Essentially, protein nanopores reside in a membrane that is not a lipid bilayer, sitting on an ASIC that can capture the change in current when the DNA passes through the pore with help of an enzyme that is not a polymerase. The change in current over time is converted to a 3-base call as multiple bases are sensed at the same time. A Markov model base caller is required to resolve the 64 possible states with a probabilistic path to derive individual bases. Clive reported that they had sequenced the PhiX genome as a single fragment with 96% accuracy. The main error mode was deletions with DNA wobble potentially being the major source of error. Residual errors may originate from spread of triplet currents so better pores will be needed (mutant pores, charge carriers). Base current also varies with modified base contributing to error. Clive mentioned that RNA can be read directly though a different enzyme will be required to drive the RNA through the pore. (My take: it will be challenging to not clog the pore with a RNA knot due to stable RNA tertiary structures). DNA modifications can be very sensitively detected, according to Clive, but no data was shown.  Clive mentioned that sequencing can be done without sample prep, as an example they used blood from a friends' rabbit and water from a drain (?) directly onto the membrane. Wonder how they address the challenges of keeping the buffer composition stable to get a predictable current profile on addition of the biological materials and also avoid clogging the protein nanopore or preventing the DNA- or RNA-protein pore-enzyme complex from being degraded. In addition to the GridION system described previously, Clive announced a new product, the MinION, a USB stick based disposable sequencer with preliminary specifications of 500 pores being able to sequence 150 bases per second. They also expect to have the first GridION system with 2000 pores with 250-300 bases per second from a node. A rack could hold up to 20 nodes. In response to a question on whether the USB would be compatible with a Mac or a PC, Clive stated that he is a Mac User.

Andy Watson from RainDance, provided more details about the oil-aqueous droplet system (1 B droplets per day) and its new application for digital PCR (on a new system). For the existing NGS amplicon prep system a primer design pipeline is available with > 800 designs done to date. A cancer hotspot panel with > 13,000 COSMICmutations across 42 genes is available for FFPE samples and has been validated on the MiSeq system and the Ion Torrent PGM.

Elaine Mardis from Washington University talked about the NuGen Mondrian digital microfluidics system that is apparently as good as Sean McGrath, their sample prep maestro, at producing robust libraries for Illumina sequencing from vanishingly small amounts of DNA (5-10 ng).  Used human buffy coat DNA and Enterococcus feacalisDNA to prepare libraries from different kits that were then sequenced on the MiSeq. Data was comparable between Mondrian and other manual kits.

Well there were quite a few more talks in a very busy day which I will get to type in once I am back home. Still have time to check out a few more posters … before the evening sessions start.