Blog @ Illumina
Real scientists. Real commentary.

Utilization of New DNA Sequencing Technologies: An AGBT Tutorial Workshop, Part 1

Abizar Lakdawalla, Ph.D.
| Feb 16, 2012

Beautiful weather, incredible science, what a combination.  I only wish the presentations were held on the beach -- but guess you can't have it all!

My take on the presentations should be useful to folks who were on the beach or not at AGBT, or who are actually at the #notAGBT meeting, where apparently some compelling technologies have been launched…

Here are some highlights from the first series of AGBT Tutorial presentations.

Mark Adams, J Craig Venter Institute
Took us on a tour of the existing NGS platforms. He started off describing the first NGS box, the 454. It still offers high accuracy long reads, though according to Mark, the bead prep is messy and difficult to automate, and calling homopolymers can be problematic. He acknowledged that Illumina has the largest installed instrument base, and that 90% of all sequencing data is currently generated on Illumina sequencers. The relatively long run times on HiSeq matters to some people, though it may not be that important to others. Talked about the SOLiD 5500 and PacBio getting attention because of true single molecule, long reads, and the ability to read modified bases directly.

For Complete Genomics (CGI), he referenced figures from the publication that compares CGI and Illumina human sequencing data. Calls absent on CGI were due to lack of coverage in those areas by CGI and vice versa. Both platforms seemed to undercall indels, with poor concordance between the calls (my comment: this might be due more to the calling algorithms used rather than the data). The talk ended with a discussion of what matters most for considering NGS, coming down in favor of cost, accuracy, and turnaround time.

Danielle Perrin, Broad Institute
Gave an impressive presentation on their high-throughput process development. She showed a jaw-dropping graph with the number of samples that Broad had done; 15,000 whole exomes, ~3000 whole genomes, and ~12,000 custom pull-downs. For whole genome sequencing, they use 100 ng of input, perform sizing with the Sage Pippin prep, giving them libraries with up to 3 billion unique molecules per library. They have five people performing library construction, and four people who do the hybrid capture and related QC on the thousands of samples. She talked about the concept of pooling penalty, the mean number of reads divided by lowest number of reads of a library, which I took to mean how many reads did a library take in excess of the rest of the crowd.

What was interesting for me was the way they are using MiSeq as a QC tool, apparently every pool is QC'ed on the MiSeq with very good predictive agreement between MiSeq and HiSeq. The other surprise was that 51 HiSeq instruments are operated by just six staff. Hope they are not breaking any labor laws.

Thomas Keane, Sanger Institute
The focus of this talk was on the informatics challenges associated with large scale resequencing. Started off with a graph that had the obligatory steep slope showing total sequence from 2007-2009, informatics doomsayers had said that analysis could not keep up. The next graph, after the launch of HiSeq, showed an even steeper trajectory! Their Vertebrate Research Informatics group has gained a lot of experience in the terabytes of data world from several phases of the 1000 Genomes Project. The Sanger Institute has started on a UK10k project, with ~ 4,000 whole genomes and 600 exomes containing samples from 2000 twins. The plan is to produce 100 Tb, 40 Tb of which will be from BGI. They get a shipment of disks from BGI every other week, and are on track for data collection to be completed by the first half of this year!

Thomas went over the process of data analysis and how data from different systems and collaborators are merged and re-merged across multiple samples into cross-sample BAM files. The variant calling pipeline is based on samtools, GATK, VQSR, Beagle, and VEP annotation is used to produce the final variant call file. Pause to let head stop spinning. Data management seems to be addressed well with rule-oriented data management (iRODs), this is open source with origins in the particle physics world, rule engine, akin to source control, with application level metadata (run, lane, plex, sample, library – searchable on individual terms). The talk ended with discussion of potential file format based on CRAM, offering a sliding scale, from no, to lossless, to lossy compression, which can also keep pairing information, preservation of unmapped reads, and arbritary tags.

Continued in Part 2