Our last blog on third generation sequencing is still mostly relevant and current (here), so this post is an update where improvements are noted. The methods have matured a lot. The players are still largely the same: PacBio, Nanopore, Bionano and Hi-C. 10x has dropped out of the genomics field to focus almost exclusively on single cell.
This leaves us with two sequencing methods and two scaffolding methods that can be mixed and matched in any of the four possible combinations. Why do we include scaffolding in a blog about third generation sequencing? Scaffolding is required to get a really good (reference quality) assembly, with possibly chromosome length scaffolds. What I see most often is PacBio paired with Hi-C, and that is what I would first suggest. But Nanopore has attractive features—look at pricing, and what your local center offers, if you have one. Bionano can be a cheaper alternative to Hi-C as well, if your main goal is a sequenced genome and you aren’t trying to get additional information. A table comparing the different methods with costs is (here).
An advantage of these long read sequencing methods is that they can separate many regions into two haplotypes (phasing), in the regions where there is enough differentiation to allow the two haplotypes to be assembled separately. By requiring higher identity in assembling scaffolds, these haplotypes assemble as bubbles in the genome assembly – allowing large sections of the genome to have full haplotypes. Haplotype-level assembly has changed quite a bit through the years we have been doing these reviews – first 10X carved a niche by offering the only haplotyped sequencing, but they aren’t in the game anymore. Hi-C is useful for phasing. Instead you can get by with pretty standard long-distance sequencing to get haplotype sequences for diverse regions of your genome, or you can use the trio method of experimental design to get a full haploid assembly—here is a paper that uses trio with PacBio. Here is a blog post that discusses how to use PacBio and Nanopore data to phase genomes (here) without trios—a little wonky but I like it a lot.
Note that for some of these methods the DNA has to be really intact; I haven’t seen anyone lysing cells in gel plugs, like I used to do for pulse field gels, but check with the center you are using. I have seen centers that will take your prepared DNA and run a pulse field gel analysis to see how intact it is and for some centers it’s not clear to me how they are isolating the DNA.
PacBio Single Molecule, Real-Time (SMRT) Sequencing technology
“In fact, the new Sequel II actually produces data at a lower cost per Gb than our Illumina HiSeq 4000!”. Don’t get too excited, this is probably raw reads, not corrected reads—see below.
The HiFi method has become the favored method for SMRT sequencing, where each single template molecule is read multiple times, and these are used to self-correct. If there are 10 subreads per template, the accuracy of the consensus is reported to be 99.9%. These consensus reads are then assembled. Aside from HiFi, samples can be run under CLR mode, where linear templates (not circularized as in HiFi) are sequenced for as long as they can be, perhaps an average of 200Kb. Our impression is that PacBio is now more popular than Oxford Nanopore for genome assembly, but PacBio seems to be much better are trumpeting results using PacBio than Nanopore, so its superiority may simply be the level of chatter.
Prices for PacBio sequencing have come down a lot, but call a PacBio sales rep for a quote. You will want to compare this to Illumina short read methods (here), which won’t give the isoform data but will be fine for identifying genes and probable coding sequences. I always suggest getting quotes from sequence centers for accurate pricing. Note, while many people go directly to PacBio, there are plenty of centers that now own PacBio sequencers. Ask if they have the newest Sequel IIe, or if they are working with an earlier model (the main difference between the II and IIe is in-machine computational power, not chemistry). I’m imagining that at PacBio, they are always using the newest model, but they list service providers on their site. Go here to get a quote from PacBio
On 1 October 2019, PacBio released the 8.0 software and 2.0 chemistry for Sequel II. For larger templates read as “continuous long reads”, an example human library yielded N50 read length of 52,456 and yield per cell is 182 GB. For libraries below ~20,000 bases, read in circular consensus sequencing, yield per cell is quoted at 450 GB or about 30 GB of HiFi corrected reads. See PacBio’s current pricing here.
They suggest 1000 CPU hours per sample for computation, so still pretty demanding, thus they added onboard computational power into the new IIe . Their new assembler is HiFiasm and seems to be the preferred assembler for PacBio data, although CANU also sees use.
A fun example of HiFi use is the 27Gb California redwood genome (here), a part time hobby of PacBio personnel. This from that report: “As a general recommendation, 10- to 15-fold coverage in HiFi reads is the ideal range to yield a genome that measures up favorable in the 3 C’s of genome quality. Increasing the coverage to 33X significantly improved the assembly” so this is the number I would use to calculate how much sequence you need. Remember that a single SMRT cell gives ~30 Gp of HiFi corrected reads., so they used up a lot of SMRT cells on the project, say 30 cells? But a 1Gbp genome would only need a single SMRT cell. There are rice genome projects that used 56.73 Gb (~150X) and 86.85 Gb (~230X coverage) of PacBio (but I bet that wasn’t cheap!). For reference, below are some of the output to be expected from each flavor of SMRT sequencing:
- Estimated 100-150 Gb for long-insert genomic CLR libraries (single or multiplexed)
- Estimated 300-500 Gb for mid sized insert genomic HiFi libraries. Note: After performing CCS analysis on this genomic HiFi data, PacBio’s yield estimate is 20 Gb of HiFi data >Q20 per cell.
- Estimated 150-400 Gb for short-insert libraries (IsoSeq, amplicons, plasmid digests, capture pulldowns, etc.)
*I’m not sure what insert sizes these are. Since they often say 17-20 kb is best, I imagine that means mid-sized inserts.
A couple of nice papers that used HiFi reads in a genome project are black soldier fly (here), tomato reference-quality assembly (here), and giant lungfish (here). Here is an assembly of HiFi sequence that doesn’t use scaffolding, given as a protocol to produce plant genome (here).
Oxford Nanopore Technologies
Information on Nanopore seem harder to come by, compared to PacBio, but PacBio has been pretty aggressive in their out-reach.
Nanopore data is generated by electrophoresing a molecule through a tiny pore, and measuring the change in current for each base going through, which is different for each of the four bases (and different for modified bases as well, so modifications to nucleotides can be read directly without having to treat a sample with bisulfate or demethylating enzymes). Pretty ingenious, but with an error rate of 1-5% or 5-15, I’ve found both ranges. The lover range of 1-5% is similar to PacBio, before they developed HiFi. Much of this seems to depend on what base-calling algorithm is used, so you should get the best results using the right software (currently Bonito v0.3.6—see a report of the most recent (1st March 2021) release (here). The first platform was the MinION, this is the one that fits into the palm of your hand, and is entirely portable. The GridION will run 5 minION flow cells. The PromethION is their high capacity machine (24 cells).
Nanopore reads are much longer than PacBio, they can reach 330kbp in length, even exceeding 2Mb according to a report. Yield/cell is 245Gb. It can be used for both DNA and RNA (without reverse transcription), and it can read methylated bases (and other modifications) directly (read). Nanopore technology can now sequence the same molecule twice (both strands), improving its accuracy further, with reported accuracy of 95%. Interestingly, some of the improvements in accuracy is due to the refinement of the pore itself. One recent report is of 98.9–99.6% accuracy (here; for mRNA)—this is good enough to do an assembly using only Nanopore data (with scaffolding). And I don’t see why they can’t take an approach like PacBio to sequence a template multiple time with rolling circle amplification, then self-correct. While it seems a little silly to me, given how “short” transcripts are, Nanopore is used for transcriptomes (here and here). And remember that Nanopore can read RNA directly, without an error prone RT step.
A good opportunity to catch up on Nanopore technology is their upcoming London Calling conference.
Neither Hi-C or Bionano are sequencing methods per say (Hi-C does use Illumina sequencing as part of the protocol), but rather are ways to scaffold the assembly. Even long range sequence can only get contigs/scaffolds of a certain size, and both Hi-C and Bionano can stitch these together into very long scaffolds. Even chromosome length scaffolds.
This is easiest to explain in a figure —see a nice one here and see the Wikipedia entry. Sequences in proximity in 3-D space are fixed to hold them together (fixed via the proteins binding them), circularized, and these bits turned into an Illumina library. Hi-C was first developed to study chromosome structure. There are any number of genomics papers that show beautiful assemblies after using Hi-C data—there were a couple of talks at PAG last year that showed Hi-C could greatly improve assemblies that had been with Illumina data alone, and there are so many Illumina-only genomes lying around, you might have one yourself. However, I haven’t found genome papers were they set out to use only Illumina data with Hi-C, there probably are some. Originally the fragmentation after cross-linking was done with a restriction enzyme, but now there is Micro-C, a Hi-C approach using micrococcal nuclease (MNase) in place of REs. This is supposed to eliminate many of the artifacts that arise from REs. Here is paper using Hi-C (here). I’ve found ad 10M and 100M reads (Illumina) suggested to scaffold a genome. Recall that the rare reads—the ones that link sequences that are far away from each other—are the most informative for scaffolding. Given that an NextSeq run can generate 120Gb/400 million reads (NextSeq), this is a small fraction of a run, but can be combined with others projects—make sure to talk to the sequencing center so the indexing codes don’t conflict, or have them make the libraries. A variation on Hi-C is the Chicago method, which takes naked DNA and in vitro wraps it with nucleosomes—this eliminates some higher order structure that may be in the chromatin, and be confusing.
Phase Proximo and Dovetail are two companies that offer Hi-C, although it looks as if Phase Proximo focuses on the Chicago method (and metagenomics). Dovetail seems to be more the complete solution to genome assembly, and would be high on my list of places to start.
Bionano labels short motifs that occur approximately every 1kb, then using electrophorese moves long, linear molecules—individually— past a detector that records the presence of the labeled sites and the transit time between sites (the distance between sites). This technology used to rely on restriction sites – nicking the DNA and fluorescent labeling it (iris), however this has been replaced by non-destructive labeling (saphyr). The labeling gives a “restriction map” of the molecules read in, which can be hundreds of kb long. The mapping is done multiple times with different motifs and fluorescent labels (in one reaction) and generates a map of common motifs that can then be paired with a long-read methods by mapping the long reads to the scaffold created by the motif pattern. Bionano doesn’t have high resolution as it doesn’t give you individual base information, but it does fine to orient and join the long-read contigs. The resolution of the restriction map is said to be 1kb and that is sufficient to assemble restriction maps. The current model is the Saphyr (here). “As a result of the updates just announced, Saphyr can collect as much as 5 Tbp per cell of data, or over 1500x coverage of a human genome, in 48 to 96 hours with three samples in parallel, for a total of 15 Tbp on a single Saphyr Chip (here). Saphyr has the main advantage of being the cheaper scaffolding technology, but with only 1kb resolution, there is no additional sequencing depth added to the assembly as you have with the Illumina-drive Hi-C platforms.