NCGAS is teaching a course in genomics, officially FA21: SPECIAL TOPICS IN ZOOLOGY: 46585 … rolls right off the tongue, but might be better called GEMS: WORKFLOWS IN GENOMICS. GEMS is Genomics and eco-evolution of multi-scale symbioses. The concept is that we start with a data set from the literature that includes Illumina and Nanopore data for bacterial species and walk students through 0) getting comfortable on the command line, 1) obtaining the data from the SRA, 2) doing quality control on the sequence, 3) assemble the Illumina, the Nanopore, and the combined Illumina/Nanopore reads, 4) Evaluate the assemblies, 5) maybe do some annotation—depending on time constraints. The course only meets once a week for 2 hours.
Included in all this are short introductions to Illumina and Nanopore sequencing. I thought I would give these to you as blogs, this one on Illumina and another to come on Nanopore, and eventually one on PacBio. Note that we recently posted a blog on next generation sequencing, i.e. long read sequencing methods (and I included some scaffolding methods). That has a different flavor from what I’m doing here. And although there must be a zillion intros to Illumina, this is mine. So let’s get started!
- Illumina was/is the most robust of the 2nd gen. chemistries
- Has a low error rate, and its error profile is well understood
- It’s cheap, and many many centers have machines
- There are many commercial kits for library prep (as well as DIY protocols)
- Its relatively short reads are fine for differential expression (and can be used for de novo assembly of the transcriptome)
- However, although many many genome assemblies have been done with Illumina-only data over the last several years, these are at best fragmented “draft” assemblies.
Basic Illumina library prep and sequencing:
In the above graphic, Step 1 is shearing the DNA, usually by a method that selects for a general size, and then often a size selection. Step 2 is adding adaptors via ligation—we’ll look at the adaptors below—often followed by a PCR step to given more material, Step3 sequence in the adaptors hybridizes to the matrix of the flow cell. Step 4 & 5 amplify each single bound strand to give clusters of sequence that will all be sequenced in tandem. The clusters give a signal that is brighter than a single strand could be. [PacBio uses a method that allows the sequence coming from a single strand to be followed.]
Each nucleotide G, A, T andC has a unique fluorescent ligand attached. I’m not actually sure what the biotinylated U is used for. Each nucleotide also has a 3′ block, so it can’t be extended from, until the block is removed. These special nucleotides are a large part—I think—of why Illumina is a little expensive (the same applies to PacBio).
This shows the sequencing rx “close up”. All four nucleotides are added but only one will pair to the strands in a cluster. The other three are washed out and the clusters imaged. Then the flour on the newly polymerized nucleotide is removed, as well as the stop. The replicating strand can now be extended again as the cocktail of four nucleotides are added.
What is the actual data? Spots!
An image of a flow cell and a close up. You will never see the TIFFS: Illumina’s in-machine software uses cluster intensities, noise estimate, etc. to output the sequence of bases read from each cluster, along with a confidence level for each base. Note that each spot has a “color”.
Coming back to the adaptors, this is what they look like.
Figure 1 | Structure of a sequencing-ready Illumina-compatible library. The insert sequence (gray) is flanked by two sequencing adapters. The P5 adapter Contains a flow cell binding region (black). This sequence can also coincide with the binding site for the Index 2 sequencing primer for the optional i5 index (Index 2, purple). * Depending on the sequencer the index 2 sequencing primer binding site can be located in the inner or outer region of the adapter. The P5 adapter also contains the Read 1 sequencing primer binding site (green). The P7 adapter contains a flow cell binding region (orange), the i7 index sequence (Index 1, yellow) and the Read 2 / Index 1 sequencing primer binding sites (blue)
Note that the two indexes are independent reads ( you might only use one of them). Then there are forward and reverse priming sites to sequence across your insert—you might only sequence from one end, depending on your needs (RNAseq for differential expression really only needs a read from one side, to map to the reference genome or transcriptome). So at max there are four rounds of sequencing from a given primer, washing that primer off, annealing the next primer, and so on.
Firecrest is the module used for image analysis. Firecrest identifies cluster positions and extracts intensities. Through image filtering, it sharpens and enhances clusters, removes background noise, and detects clusters based on morphological features on the image. Firecrest also adjusts the scale and registration of an image. Firecrest is currently performed in real time with the sequencing process on a dedicated IPARR server as part of the 1.0 pipeline.
Bustard is the module used for base calling. Bustard deconvolves the signal from the clusters and applies correction for cross-talk, phasing, and pre-phasing.
- Frequency cross-talk—The Genome Analyzer uses two lasers and four filters to detect four dyes attached to the four types of nucleotide, respectively. The frequency emission of these four dyes overlaps so that the four images are not independent. The frequency cross-talk is deconvolved using a frequency cross-talk matrix.
- Phasing/Pre-phasing—Depending on the efficiency of the fluidics and the sequencing reactions, a small number of molecules in each cluster may run ahead (pre-phasing) or fall behind (phasing) of the current incorporation cycle. This effect is mitigated by applying corrections during the base calling step.
- All of these corrections are based on an assumption of equal base frequency. For this reason, Illumina recommends the inclusion of the PhiX control sample in all runs (7+1). Work is currently under way at the [Broad Institute] to create defined spiked in DNA templates with equal base frequency that can act as control reads, however, this has not yet been incorporated into the pipeline.
Each base call, so each extension step for each cluster, is given a quality score—how certain the software is that the assigned base is correct. This is a Q score bases (also known as Phred quality scores).
- Quality scores (Q) range from 4 to about 60, with higher values corresponding to higher quality. The quality scores are logarithmically linked to error probabilities:
- Quality Score Probability of a wrong base call Accuracy of a base call
- Q 10 1 in 10 90 %
- Q 20 1 in 100 99 %
- Q 30 1 in 1.000 99.9 %
- Q 40 1 in 10.000 99.99 %
- Q 50 1 in 100.000 99.999 %
Now, if you have been downloading sequences from genebank for a while, you are probably familiar with the fasta format, which has a single name line with the > then a return and the sequence:
>MCHU – Calmodulin – Human, rabbit, bovine, rat, and chicken
But what a sequencer gives you is a fastq file, which still has the quality score for each nucleotide:
In a fastq there are four lines, with returns between them. The label line starts with an @. A “+” separates the sequence line from the quality score of each base, encoded in a one character designation. There is actually a lot more to be said about how the quality is encoded, and I believe Layla is going to address that in a separate blog. But for now, know that programs like assemblers are using quality scores as they align sequences. You will probably trim your reads before using them, to remove sequence that is poor quality (using fastQC to visualize the quality, and trimmomatic to removed it from each read).
Let’s come back to Illumina. There are a number of considerations pertaining to the libraries you will be making from your genomic DNA, cDNA, amplicon etc.
- Indexes: Single or double: how much seq. per library, libraries will you pool to put into a lane.
- You can ride on someone else’s run, if the indexes are compatible; Fractions of a lane: talk to your center
- Skim sequencing as quality control
- Balancing pooled libraries. You need to quantify each library, the pool them so there are equimolar amounts of each
- Have the center make the libraries, if possible
- Need to get the indexes right—just being different isn’t good enough
- I can never keep this straight and always consult with the center
- If you are doing—say—population genomics, with ≥96 samples, libraries become a significant cost.
- A library gives many “runs worth” of sequence, so you can always go back to a library and seq. more.
Almost done! Let’s talk a little about bias in Illumina sequence. Almost any of the sequencing methods have some. For Illumina, some of it comes from the PCR amplification, after the adaptors are attached, and before the library is attached to the flow cell.
- The GC content of the genome is an important one (see next slide)
- The library amplification (PCR) step
- The context of each nucleotide being sequenced
- the first 10 nucleotides of sequence is of poor quality (not bias, but still)
- the second read is not as good as the forward read (again, not bias but I had to put it someplace)
I’m going to illustrate this with one figure from a recent paper, that specifically looked at the GC bias of most of the major sequencing platforms.
Figure 1: Coverage biases in the sequencing of Fusobacterium sp. C1 [an AT rich species]. The circle plot shows from the inside: GC content …
Look at the blow-up: the black squiggle on the left of the blow-up is GC content. It shows that the genome is GC poor, except for the rRNA loci, which are constrained for structure and far more GC rich. MiSeq and NextSeq have a sharp coverage peak over the rRNA locus. HiSeq and PacBio have a peak, but much broader, and the Nanopore wins the day, with pretty even coverage!
That’s that. I hope this is helpful, and by all means ask me to add content where you feel it’s missing.