Blog @ Illumina
Real scientists. Real commentary.

Hello, Moleculo

Amy Cullinan, Ph.D.
| Feb 11, 2013
dna strands perspective

Earlier this year, Illumina announced the acquisition of an unusually small and somewhat obscure technology startup with a 1960’s-futuristic kind of name. Google searches not only produced a few chuckles at talk show host Conan O’Brien’s 2001 skit, but also generated excitement about long synthetic reads overcoming certain technical hurdles in next-generation sequencing, such as repetitive elements and resolution of highly polyploid genomes.

Founded in 2012, Moleculo is a spin‐off from Dr. Steve Quake’s lab at Stanford University. At the technology’s core is an innovative sample preparation method that breaks input DNA into large fragments, moves these fragments through the clonal amplification step, and then shears and tags them with a special barcode before sequencing by synthesis on Illumina instruments. Using more than a few proprietary informatics tricks, the tagged, short reads are then re-assembled into long, highly accurate "synthetic reads.” In essence, Moleculo methods transform short‐read data into long reads with an unprecedented combination of read length, accuracy, and throughput that can be obtained when coupled with existing Illumina sequencers.


The Long and Short of It

Generating long sequencing reads by next-generation sequencing isn’t particularly new, but has previously been achieved by sacrificing accuracy and coverage depth. The prospect of translating the high coverage and accuracy of Illumina’s short-read technology into virtual long reads opens up some interesting research doors.

One well-known long-read application is de novo sequencing—the construction of a genome for which no reference sequence (or an incomplete reference) exists. Highly repetitive regions pose a major problem for the assembly of complex genomes. In most cases, a read that is ~200 bp (base pairs) long is not unique enough to map to a specific area and cannot assemble into a distinct, mappable fragment. Long reads typically 2-5 kb (kilobases) in length, are needed to span these repeat regions. Since current NGS platforms produce either small numbers of long reads or large numbers of shorter, paired‐end reads, de novo projects often use multiple technologies and align shorter reads to long-read scaffolds. As reported at the Plant and Animal Genomes conference in January, using Moleculo methods in combination with HiSeq systems effectively demonstrated the utility of long, accurate DNA reads in bridging repetitive regions of the blue catfish genome.


The Next Phase

In genome-speak, “phasing” refers to the unique content of the two chromosomes inherited from each parent, and most common sequencing methods do not capture or preserve this information. However, recent findings suggest that relationships between genotype and phenotype can be better understood within the context of phase. Moleculo’s long-read technology can span more than one heterozygous SNP, enabling a phasing algorithm to “stitch” multiple long reads into a single haplotype. Examining the variants uniquely positioned on each homologous genomic region may contribute greater understanding of gene function in deregulation or disease states.

In addition, genetic disorders are caused by disruptions to either the “cis” position (when two alleles are located near each other on the same chromosome) or “trans” position (when they are located on different homologous chromosomes). Using long synthetic reads, Moleculo’s algorithm can accurately phase the vast majority of human genes end‐to‐end.


Complexity, Clarified

Humans are diploid (2n) creatures who like to think they possess complex genomes, but in reality, staple crops like oats and sugar cane elevate genome complexity to a whole new level through polyploidy. Polyploidy is a numerical change in the number of chromosome sets, and these extra chromosomes are likely derived from ancient duplication events. Variable polyploidy can also occur under certain conditions when DNA replication outpaces cell division, as in the exponential growth phase of bacteria. Since Moleculo is a single-molecule technology with every long read originating from a single haplotype, introducing these synthetic long-read methods may be very useful in routine sequencing of complex polyploid and highly heterozygous samples.


Stay Tuned

Some early‐access collaborators are already using this technology for de novo assembly of non-model organisms, metagenomics sequencing projects, detailed analysis of unstable cancer genomes, and distinction between functional genes and closely related pseudogenes. The best is yet to come, so stick around for updates on this exciting complementary technology.