Over the last two years since our first introduction to Third Generation Sequencing (TGS) platforms (found here), a lot has changed. de novo genome assemblies are becoming more affordable, technologies are improving and maturing, and chromosome level assemblies are not uncommon. Alongside the technical advances, large genome consortiums are starting to make significant progress—The Vertebrate Genome Project has completed genomes for all 48 orders of birds (through the subgroup Avian Phylogenomics Project, website here) and developed a pipeline (seen here) to work through the rest of the orders.
While improvement is great, it can make planning a project a bit difficult. Sometimes it’s hard to know which version of the technology was used (this is particularly bad for Bionano technologies) and what to expect from current offerings. So, today, in honor of DNA Day, we’re going to update you on all the major players, update our comparison/price table, and talk about important considerations while planning a project in 2019!
TGS Technologies Introduction and Update
Extraction of High Molecular Weight DNA
Genome Specific Considerations
TGS Technologies Introduction and Update
There are three main long-read sequencing technologies and two scaffolding technologies that are the real drivers of chromosome-level assemblies.
PacBio (website) is one of the most established long-read technologies, having been around since 2010. The technology uses single-molecule real-time (SMRT) sequencing, which uses hairpin adaptors on a double stranded DNA fragment to make a single stranded circular DNA template. This template is then loaded onto the SMRTcell chip (they REALLY like this acronym), where sequencing-by-synthesis occurs and the different light pulses emitted during synthesis of each nucleotide are recorded in 0.5-4h long movies. You may see this method referred to as CLR, or continuous long-read sequencing.
Since the template is circular, the polymerase can continue multiple passes over the strand, which can then be split at adaptor sequences. The more passes, the more accurate the base calls. On currently available technology (Sequel 6.0), it takes ~4 passes to get QV20, and 10 to get QV30. The use of these passes to increase quality is referred to as CCS (circular consensus sequencing). Returned PacBio reads often include different sets – subreads (all complete reads from each molecule), long_reads (reads that didn’t make it through the entire molecule and therefore lack subread and internal correction), and CCS (corrected consensus reads of all the subreads).
The most recent development in PacBio is the Sequel II System, which was released April 24, 2019 (see release here). The original Sequel (currently v6.0) saw a lot of improvements in read length (now 30-100kb rather than 5kb reads on average), quality improvements, and identification of indels. Sequel II is the new machine that leverages these improvements to the chemistry, but adds increased output in the order of 8-fold. Sequel II is producing just under 100Gb of data (as opposed to 15Gb of original Sequel), 4-5 million reads (was 500,000), and quicker run times—all while maintaining the same increase in quality (see this video for more details). Haplotype phasing is also available, see here for more details!
PacBio terms to know:
- SMRT – single molecule real time, PacBio’s sequencing technology
- CLR – continuous long-read, the nature of SMRT wherein the circularized DNA is resequenced numerous times to create subreads
- subreads – reads from CLR that are all from the same molecule, collapsed to perform internal error correction and increase quality. Collapsed subreads form the CCS.
- CCS – circular consensus sequence, the high-quality read produced from several subreads of the same molecule of DNA.
- more here!
Oxford Nanopore (website here) works by feeding a strand of DNA through an electrically resistant protein nanopore membrane. The sequencer reads the disruption in the current passed through the membrane as each base passes through. Since the bases have different electronic characteristics, each nucleotide has a different signature of disruption, allowing for base calls. Machine types (FIongle, MinION, GridION, and PromethION – see here) all vary mainly by size of the sensory and number of pores. The FIongle is a portable device that does small rapid testing of amplicons or targeted sequencing, MinION is a single flow cell sequencer that connects to your laptop and can do all the basic sequence projects, GridION is for higher throughput (5 flow cells), and the PromethION is for very high throughput (48 flow cells).
One of the more interesting advancements in this technology is its development of nCATS (nanopore Cas9 Targeted-Sequencing) which allows for targeted sequencing with long reads. Using Cas9 to cut the DNA at specified locations and ligate adaptors to the ligates, 165X coverage of 10 loci with a median length of 18kb was done with a single flow cell (replicable on the FIongle or MinION!). This is excellent for follow up work on a genome, such as methylation patterns or structural variants. More details can be found here.
Oxford Nanopore Terms to know:
- MinION – small scale desktop sequencer that is highly popular, runs a single flow-cell
- PromethION – new large scale sequencer that runs up to 48 flow cells
- nCATS – nanopore Cas9 Targeted-Sequencing, allows for sequencing of only targeted sequences by leveraging Cas9.
10X Chromium (website) is a kind of synthetic long-read technology (similar to Illumina’s former long read tech) that uses barcodes to group and assemble short reads from a single large template. Basically, it is breaking the assembly problem up into a bunch of small, local assemblies (each for a single long molecule) before combining them into a full assembly. A small amount of high quality DNA is separated into droplets, each with ~10 molecules of 100kb DNA. Each droplet is individually barcoded and amplified through emulsion PCR, producing short fragments of the 10 molecules in each droplet. These fragments are sequenced to ~56X with Illumina HiSeq, and post-processed into bins of similar sequences across barcodes. Each individual molecule has low coverage (~0.2X) per droplet, but combining the ~150 molecules covering each stretch of the genome, the resulting sequence coverage is ~30X over up to 200kb fragments. Since the reads from each molecule are barcoded and therefore linked, haplotype information can be obtained over large genomic ranges.
Using this information, 10X was the first to offer phased haplotypes in long-read sequencing. The resulting assembly is not fully phased, but provides phased “megabubbles” in the assembly using the above local assembly of short reads. More info can be found on their site, with well done videos here.
Chromium sequencing sets itself apart by ease of use—only one library made from 1ng of DNA is needed, and Illumina sequencing underlies the technology, making any biases familiar to most researchers. 10X’s software, supernova, was also recently updated, almost doubling the continuously phased regions (phaseblocks) produced for several genomes, decreasing the error rate, and decreasing misassembly rates.
10X is actually pretty straight forward in terminology. Probably part of their interest in being easy to use!
Image from 10X website
Bionano Saphyr (website and white paper) is a unique departure from other sequencing technologies in that 1) it harnesses original gel capillary methods from the Sanger days and aims for low resolution – about 1kb; and 2) it’s a scaffolding technology, not a sequencing technology. The goals of using this technology are typically to either provide a low-resolution, long-range scaffold for other sequencing or to identify large structural variations that can be missed by shorter-range sequencing. For instance, it has 99% sensitivity for 30Mb inversions and a 97% sensitivity for copy number variants larger than 500Mb — something you aren’t going to find with other technologies. More details can be found here.
Bionano has been going through a lot of evolution, meaning it can be a bit hard to keep track of how useful performance reviews are in gauging expectations based on current technology. At its heart, Bionano relies on labeling specific short motifs that occur approximately every ~1kb throughout the genome. The long, intermittently labeled fragments are fed into a chip, where the sequences flow through gel micro-grooves and (admittedly beautiful) photos are taken of the fragments as they pass through the chip. These images are then analyzed for overlap, and long-range scaffolds are made, with intermittent sequence information present to allow for scaffolding on sequencing done with one of the above technologies. Originally, Bionano leveraged restriction sites as their small motifs, making this method conceptually equivalent to restriction mapping and optical mapping, but with very high throughput.
Here’s where the confusion can lie: Bionano Saphyr is the successor to Bionano’s Irys platform that came out in February of 2017. Saphyr improved the scaffolding ability of Irys by about ten-fold and allows for phasing (see here for more info). This is important to know as many leave off which Bionano technology was used in “bake off” papers, etc. So, if you see what looks to be a poor performance of Bionano in a paper or project, double check to see if it used the old technology or was done before February of 2017. It makes a world of difference, especially if you are trying to make economical decisions Bionano is cheaper than Hi-C)
To make matters a bit more confusing, Bionano Saphyr has two major prep kits. The original tactic to label the DNA involved nicking the sample DNA, labeling it with fluorescent nucleotides, repairing the nicks, and then staining the DNA so that it can be read. This technique is called for “nicks, labels, repairs and stains” (protocol information here). The newer kit, which is still on Saphyr but is new this year, is a direct label and stain (DLS) technique, which labels specific motifs directly (without nicking) and is then imaged the same as NLRS (protocol information here). DLS allows for >2Mbp molecules in the imaging step and has led to ~50-fold increases in map length.
Bionano terms to know:
- Irys – old technology that had lower performance than the current version, Saphyr
- Saphyr – a newer version that came out in February of 2017, increasing quality and size significantly
- NLRS – original prep method of nicking, labeling, repairing, and staining DNA to get imaging, recently replaced by DLS.
- DLS – new direct label and stain prep that recovers longer molecules and increases map length.
Image from bionano website
The other major scaffolding protocol is Hi-C which leverages spatial information in chromatin to produce map information (among other things). Hi-C takes intact cells and cross links chromatin segments that are physically near each other. The chromatin is then cut with restriction enzymes and the ends of crosslinks sections marked with biotin and ligated together and the remaining DNA is sheared. Biotin-tagged fragments (where ligation occurred) are pulled down and sequenced using Illumina. Based on the assumption that most proximate chromatin occurs in proximal portions of the same strand, sequence cross-link likelihood is a function of physical distance. More common co-occurrence of sequences translates to those sequences being more likely proximate.
This technology also relies on other sequencing, as you are only sequencing the proximity ligation libraries. Additionally, this technology is using Illumina sequencing, so it has the same familiar low error rate and known biases (GC extremes and homopolymers are hard to cover). The technology is very amenable to acting as a scaffold for other sequencing technologies, or immediately improving older assemblies. Hi-C has increased genome N50s from 14Mb to 92Mb in the goat genome, 5.4Mb to 39Mb in the hummingbird, 1.4Mb to 179Mb in a rattlesnake and 5Mb to 41Mb in the Black Raspberry.
There are two main companies that do Hi-C, Phase Genomics (website) and Dovetail Genomics (website). Both do Hi-C, but Phase is the only one that does microbiome work (see here) and Dovetail is the only one that does cHi-Cago sequencing. cHi-Cago is very similar, only that it artificially recreates the cross-linking from non-intact cells.
Hi-C terms to know:
- cHi-Cago – Dovetail Genomics version of Hi-C that doesn’t require intact cells as folding is done artificially.
- TADs – topographically associated domains, regions of DNA that preferentially contact each other and may have functional significance. A current research topic that will leverage Hi-C technology. More information here.
Above image from Phase Genomics website
Above image from Dovetail Genomics website
Comparison Table for the Technologies
See table here.
Well, that was a lot to cover, and a lot to consider when designing your project. Thankfully, there are some large projects paving the way. We covered the Tegu Genome Paper (linked here) that was in Gigascience (linked here) recently, which is a good example of a genome paper. I’d recommend starting there! There is also the published Vertebrate Genome Project protocol (linked here), which I like because it follows what we have been recommending. However, there are some considerations to keep in mind, even when using another group’s workflow.
Extraction of High Molecular Weight DNA
“A lot of the standard methods to [extract DNA], like columns and magnetic beads, aren’t really that great for getting really big DNAs” – Kelvin Liu, Founder and CEO of Circulomics (via Genome Web)
The first step to any of these methods is getting the samples to start with. Long-read technology requires intact long strands of DNA with high purity, for obvious reasons. This can be a bit of an issue, especially doing this quickly and efficiently. I recommend reading a recent article on GenomeWeb (it is here, behind a sign in, but the content is free if you are associated with a university). It covers efforts by Bionano, Circulomics (website), Sage Science (website), and RevoluGen (website) to solve this problem. Definitely worth a read. And remember, library construction can be a significant expense, especially if you pay for it as a service at a sequencing center (generally recommended!).
For the last two years, we have suggested layering short, long, and scaffolding technologies to get the best results (see this blog and GigaBlog). However, as technologies mature, there are more options. At PAG this year, we heard talks proclaiming the benefits of doing PacBio/scaffolding genomes and Illumina/scaffolding genomes (note – scaffolding is highly recommended!). There are three camps in the Illumina/PacBio balance debate – all PacBio, all Illumina, and mixed. Let’s look at the considerations in each.
All PacBio is tempting since the quality is now on par with Illumina if you consider CCS (review vocabulary here). CCS generates ~2.3Gb per SMRTCell with ~13.5kb read length and costs ~$1200 per SMRTCell (prices are taken from here). A recent paper used ~39 SMRTCells to generate 89Gb of human genome CCS for coverage of 28x. This assembled nicely to an N50 of 15Mb and a concordance of 99.998%. It is clear that you can produce a genome with just PacBio, which isn’t susceptible to indel issues (once subreads are internally corrected to form CCS, see here) or GC coverage bias. However, those benefits come at a cost – Illumina NextSeq generates 120Gb of data for the price of four SMRTCells (~9.2Gb CCS). The cost of PacBio is likely to continue to improve, and it’s not quite clear if the non-CCS reads (long_read file) are useful to the assembly, which would add value. So, if you have the cash (or a smaller genome) and the computational resources (CCS processing takes ~3000 CPU hours per SMRT cell), there are attractive benefits here. Note: These numbers are all from version 3.0 of the chemistry and version 6.0 of the software (see here for the paper). The upcoming Sequel II will likely alter these numbers further, as it has an 8x increase in throughput–which is likely to reduce the cost per Gb.
High-quality all-Illumina genomes are less common (though commonly tried!), but can be done in conjunction with new scaffolding technologies. With the ability to produce 120Gb of 150bp paired reads for ~$5K (prices taken from here), Illumina is still the cheapest means of getting high coverage and high accuracy (details here). Some genome projects are increasing coverage and foregoing PacBio (see the Northern White Rhino commentary here). Some scaffolding technology is important here, so that reads can be assembled into larger scaffolds. Bionano is cheaper than Hi-C, which is attractive if you are avoiding PacBio for cost reasons. Hi-C is also based on Illumina reads meaning that it will have the same biases as the initial coverage, but it does add more Illumina sequencing to the assembly, which may help. Since this is Illumina based technology, I would avoid this on any enormous genome or one with extreme-GC content, as biases and necessary coverage are likely to become a factor.
The third group is still the most common. There is a benefit to PacBio long reads and Illumina short reads, and different biases. However, there are a lot of ways to combine these technologies. You can go with the light PacBio sequencing, deeper Illumina sequencing, and use MaSuRCa (see here) or MARVEL (see here) to scaffold the Illumina against the PacBio long reads (as in this paper on the sea pansy and this paper on the axolotl salamander). You can go with more balance, using enough PacBio to give you longer scaffolds, Illumina to give you more coverage, and use software such as Proovread (see here) to correct indels in the PacBio with short reads (as in Tegu paper). You can go with canu’s trio binning (see here), and use Illumina just to identify kmer patterns in each parent genome and heavier PacBio to sequence an F1 hybrid of the two. There are a lot of options here to help balance between the two, but the considerations are going to be cost and making sure you can correct errors in the PacBio (via short read correction, which is going out of favor, or enough depth of PacBio to still use CCS reads only).
These are the most common camps we have come across, though I’m sure there is an argument for 10X or Nanopore only sequencing as well. No matter what you choose, because this is still a matter of debate, be prepared to defend your error handling in your publication!
Prior Data – Use What You Have
Illumina sequencing has had the largest market share of sequencing for many years and has been used to produce many genomes, though generally not of the same quality as those now being produced. All of that data is still very useful – in synergy with the new long read technology. Illumina sequencing is still the most accurate raw output of the current technologies and relatively cheap per bp (though PacBio is catching up!). It’s so economical and accurate, 10x and Hi-C both leverage it in their technologies. Hybrid assemblers (Canu, MaSuRCa, etc.) allow for the combination of data input to reduce the cost of de novo assembly. Supplementing PacBio or Nanopore reads with Illumina short-reads reduces the long read depth needed to achieve reasonable accuracy while still spanning long gaps. Supplementing older assemblies with either scaffolding technology has achieved considerable improvement of draft genomes. There are some that argue that old drafts only need scaffolding to get to reference quality, but see the above considerations in Illumina/long-read balance.
Genome Specific Considerations
There are many different considerations for specific genome characteristics—such as size, heterozygosity, and GC content. See our companion post on GigaBlog (linked here) for a discussion of these genome-based considerations!
When budgeting…keep in mind that informatics is now a large part of the time/money for a genome (see here for a breakdown). I’ve said it over and over again—when you hear about “doing a genome for $10k”, this does not include personnel time. These technologies are always upgrading and working through the kinks will take longer than you expect. Even 10X, with its very easy-to-manage workflow, has some difficult software to install. Also, consider that each sequencing platform comes with its own software, sometimes a lot of it (PacBio’s SMRT toolkit for example).
For example, 23 different software packages were used just to assemble the genome/transcriptome for the Tegu paper (linked here). That number jumps to more than 44 packages needed when you include downstream analysis such as annotation. And that was for a genome without any major quirks.
Planning for an analysis budget as well as a detailed analysis workflow from the get-go will help keep these costs from ballooning. Running tests on smaller, publicly-available data sets will help you get familiar with the biases and complications of the software before investing in sequencing that may not be sufficient. Proper planning of the software and resources needed will save you from having to pay a grad student to figure out six different software packages that are all dead ends!
This is the part we at NCGAS strive to help with – training people to plan and do these analyses, providing software support to save you from tearing your hair out trying to install dozens of packages, and providing or pointing you toward free-to-use machines capable of completing the demanding analyses. We have the software packages for SMRT and Chromium installed already, as well as many other assemblers. We can also help with others as needed!
Bionano (see here), PacBio (through trio), 10X, Nanopore (see here) all offer phasing support. With the increased availability of this information, it may be worth looking into the utility of such information to see if it would help in any of the analyses you have planned (it really is case by case).
10X was the first to offer phasing, and therefore has some awesome videos (see the two on linked reads here) about the utility of this information, which might serve to give you ideas!
Methalome Sequencing—more soon, but it is worth looking at this paper (under benefits of the MinION section) and this page, and this paper and this page. These links discuss how Nanopore and PacBio sequencing, respectively, can detect variants natively (without bisulfate prep)!
RNAseq—We’ll be covering this soon!