- Just out (March 1st 2021): New nanopore sequencing chemistry in developers’ hands; set to deliver Q20+ (99%+) “raw read” accuracy (modified enzyme, tweaked run conditions and further improved base calling model in the Bonito base caller)
- Nanopore reads are much longer than PacBio, they can reach 330 kbp in length, even exceeding 2Mb. Yield/cell is 245Gb.
- It can be used for both DNA and RNA (without error-prone reverse transcription), and it can read methylated bases (and other modifications)(although this is a work in progress!)
- The voltage across the pore could be reversed to reject the individual DNA molecules to be sequenced
- Perhaps most important, the long reads resolve complicated/large repetitive regions that fool Illumina assembly, and the long reads allow assembly of haplotypes (these are both true of PacBio as well).
- They also allow resolution of complicated mRNA splicing events.
- [20) Oxford Nanopore Technologies Provides Tech Updates, Dec 02, 2021]
What is Nanopore being used for?
- Genomics, both prokaryotic and eukaryotic; transcriptomics, meta genomics…
- Resolving structural variants in disease
- Real-time monitoring of third world disease outbreaks
- Is portable, and fast, compared to other methods, and comparably cheap
- Amplicon sequencing (e.g. 16S, diagnostic virus intervals, etc.)
A Nanopore Minion in Antarctica
This paper reviews Nanopore as a microbiome tool. Is Oxford Nanopore sequencing ready for analyzing complex microbiomes? It’s nice in that it considers now quantitative nanopore data can be, and how feasible field operations are.
A brief interlude on why repeats are a problem, and how long reads help/fix this problem:
In this figure you see that both Overlap, Layout, Consensus (OLC) and DeBruijn Graph (DBG) assemblers leave the red replete unresolved—they can’t determine which pair of flanks go together, green with blue, yellow with orange.
There are two general solutions,
1) Mate Pairs, two reads from the same molecule, but spaced far apart (a “large insert size”, up to ~200kbp), esp. from the Sanger/ABI era
2) Really long reads (nanopore and PacBio)
Transcript/mRNA isoforms are also a problem, that long reads can solve:
Ok, back to nanopore!
DNA flows through the pore at an average of 450 bases/second and the electrical signal is recorded at 4000 Hz, yielding 9 measurements/base on average.
Let’s look a little more closely at the process
One flow cell, 512 pores
Get up to 50 Gb data from a single flow cell (72 hours at 420 bases/second).
Simple 10 min sample prep
5 flow cells
As much as 250 Gb of data
up to 48 independently addressable PromethION Flow Cells
3000 pores per cell
Up to 14 Tb data
[SmidgION: a mobile phone sequencer announced in May 2016, currently in development]
The “squiggle plot” is produced as a fast5 file. If you have the fast5 file, you can use alternative base callers to produce the fastQ file. That’s important because the squiggle signal is actually pretty messy, unlike the above cartoon,
Zooming in a little more, you can see that there are “steps” in the current:
Zooming a lit more, we get down to where you can almost see that there is data there with which to call bases.
Base calling has improved a lot, and thus accuracy. With Boneto accuracy is >99%. The evolution of base callers:
Here are three nice papers comparing different base callers, and their error profiles
- High quality genome assemblies of Mycoplasma bovis using a taxon-specific Bonito basecaller for MinION and Flongle long-read nanopore sequencing
- Performance of neural network basecalling tools for Oxford Nanopore sequencing
- Causalcall: Nanopore Basecalling Using a Temporal Convolutional Network
Something I don’t understand yet: The cartoon up above of how Nanopore works, and many other places, suggest that as each single nucleotide passes the pore there is a voltage change that is “read” as a particular base. However, reading into it a little, it seems that multiple nucleotides are blocking the current at any one time. 5 for the the 9.4 pore (don’t know about the new 10 pore). E.g. for the 9.4 pore, the current is recording a sliding window of 5mers. Taking a quote from an above paper “Furthermore, the electrical resistance of a pore is determined by the bases present within multiple nucleotides that reside in the pore’s narrowest point (approximately five nucleotides for the R9.4 pore), yielding a large number of possible states: 45 = 1024 for a standard four-base model. When modified bases are present, e.g. 5-methylcytosine, the number of possible states can grow even higher: 55 = 3125. This makes basecalling of ONT device signals a challenging machine learning problem and a key factor determining the quality and usability of ONT sequencing.” It’s clear that this is what makes Nanopore base calling really hard! At least in part…
Among other things that are great about Nanopore (and PacBio) is the direct detection of epigenetic marks (e.g. base methylation etc.
Since with Nanopore you use native DNA or RNA (i.e. no amplification), you will always have modified nucleotides, and they need to be accounted for—it’s not clear to me how this works (same is true for PacBio). Here is a recent paper that compares some of the methods for CpG methylation detection.
Nanopore detection of modified bases
Three very recent papers:
- Discovering multiple types of DNA methylation from bacteria and microbiome using nanopore sequencing
- DNA methylation patterns differ between free-living Rhizobium leguminosarum RCAM1026 and bacteroids formed in symbiosis with pea (Pisum sativum L.)
- DNA methylation-calling tools for Oxford Nanopore sequencing: a survey and human epigenome-wide evaluation
Some now older works:
- Systematic benchmarking of tools for CpG methylation detection from nanopore sequencing
- DNA methylation calling tools for Oxford Nanopore sequencing: a survey and human epigenome-wide evaluation
- Megabase-scale methylation phasing using nanopore long reads and NanoMethPhase
- (PacBio: Genome-wide detection of cytosine methylation by single molecule real-time sequencing)
- Older: Detecting DNA cytosine methylation using nanopore sequencing
Nanopore has various ways to seq. both strands of a dsDNA molecule, then take the consensus of the two reads. Current is “1D2”. The second strand is anchored near the pore, and so has a likely hood of being the next strand to be sequenced (rather that all the other strands in solution. A previous protocol added a hairpin to the free end of the dsDNA, which seem more elegant to me, but doesn’t seem to be the current method—I could be wrong on this.
How long is a sequencing run? This is fun: however long you want it.
- Indefinite! [The buffer runs out eventually.]
- While a sequencer like Illumina (or PacBio) goes through cycles of adding a nucleotide, washing, adding the next nucleotide, etc. and are programed to do this a certain number of times (ex. Illumina 2×150 sequence), Nanopore just reals through the pore, and can be left on indefinitely. And example they give—dependent on real time base calling—is to go until you reach 10x coverage of a target sequence. You also start to get sequence immediately!
Note that real time base calling is computationally demanding, so in the field with a laptop, you may only be able to store the raw data, the fast5. Although they have a cloud platform that will do the calling, if you have internet. I’m a little confused what can be done in the field and what can’t be done without a bigger computer.
Any sequencing method will have some , but Nanopore seems to have relatively little.
There is a recent paper that looked at GC bias, which is a major problem with many methods. They examined bias for various Illumina platforms, PacBio, and Nanopore.
Gigascience, Volume 9, Issue 2, February 2020, giaa008, https://doi.org/10.1093/gigascience/giaa008Look at the blow-up: the
Figure 1: Coverage biases in the sequencing of Fusobacterium sp. C1. The circle plot shows from the inside: GC content …
black squiggle on the left of the blow-up is GC content. It shows that the genome is GC poor, except for the rRNA loci, which are constrained for structure and far more GC rich. MiSeq and NextSeq have a sharp coverage peak over the rRNA locus. HiSeq and PacBio have a peak, but much broader, and the Nanopore wins the day, with pretty even coverage!
Adaptive sampling for selective nanopore sequencing: “read until”
- This site has a really cool movie, that you really need to watch.
- But the general idea is that as the sequence from each pore is read in real time, it can be checked against a know target sequence. If the sequence in a pore matches the target, sequencing continues, if it doesn’t match, the current is reversed and the DNA strand is ejected from the pore, allowing a new DNA to dock and commence being sequenced.
Nanopore of the future
- Nanopore methodology is changing quickly.
- Base calling with Bonito is improved (uses a recurrent neural network (RNN)
- A recently published base caller uses a “temporal convolutional network (TCN) and a connectionist temporal classification decoder” which is supposed to improve base calling
- There are regular improvements in the pore itself. The most recent seems to be a pore with two constrictions, so each nucleotide is “read” twice. This is supposed to improve homo-polymer calling. This will change the nature of the raw data and necessitate new base callers (as I understand it).
- “Recently, ONT released Nanopore R10 with a predicted model accuracy of 94%, and introduced the newest version R10.3 with of 99.995% single molecule consensus accuracy, which has a longer barrel and a “dual-reader head” inside the pore”