This blog is brought to you by NCGAS undergraduate intern Christine Campbell and is inspired by the curiosity of former NCGAS Co-PI Craig Stewart (who is now enjoying well-earned retirement). This is part of a multi-part series on exploring questions about epigenetics in beetles. Follow along as we determine which species have methylation enzymes, how these patterns are organized on the species tree of beetles, and how you can explore possible patterns of methylation – entirely with bioinformatics!
The Presence and Absence of Methylation in Different Taxa
This is a beginner’s exercise, which will introduce to you to some of the uses of the NCBI portal and a phylogenetics site. But it’s also one of the things you will first do when you become interested in a particular gene/protein family.
Social taxa such as bees, wasps, and termites, have been observed to have higher CpG methylation levels of their genomes. We are interested in coleoptera (beetles) to see if this relationship of sociality and methylation can be identified as well.
There are a few questions we can ask about methylation in coleoptera:
- Are the enzymes associated with CpG methylation in one or more coleoptera genomes? Coleoptera taxa range from asocial to semisocial, so we might predict that the enzymes responsible for CpG methylation will vary presence/absence with sociality, or show different patterns of evolution—e.g. faster evolution in lineages that have gained sociality.
- Can we look at patterns of CpG depletion in genome sequences to determine if there is likely methylation going on within different species? (see: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5915941/, and a future blog highlighting code you can use to explore this in your genome!)
- Are there actual methylated C’s in the genome? (this takes real biochemistry, which we will discuss another time)
This blog will explore different taxa of coleoptera to identify if they have the two enzymes that are associated with CpG methylation.
DMNT 1 and 3 are involved in CpG methylation: DMNT1 is the maintenance methylase while DMNT3 is the de novo methylase.
DMNT2 is a homolog of 1 and 3 but unlike 1 and 3, it is used for tRNA methylation. We included it as a control given that it is a homolog of the other two methylases.
Our pipeline is:
- Find coleoptera for which there are whole genome sequences.
- Search them for DMNT1,2 &3 using NCBI’s program BLAST.
- Construct a presence/absence table for each gene in each individual taxa.
- Construct phylogenetic trees of the three ortholog families and see if there are patterns in presence/absence of methylation i.e.closely related taxa that share the same pattern of presence/ absence.
- From published literature, determine which taxa are thought to be social, see if they all have presence of the homologs, and map sociality onto the trees as a character to see If closely related taxa share sociality and correlate this with the presence/absence of the three gene families.
How to use NCBI’s BLAST to find each gene in coleoptera:
- Search DMNT1 using NCBI’s “All Databases” filter option at https://www.ncbi.nlm.nih.gov/to find a representative member of the protein families.
- The result that will pop up will be from Danio rario (a Zebra fish).
- Click on “Proteins”
- This will bring you to a list of matches; choose the one with the longest match.
- You can either click “FASTA” underneath the link OR click on the link and it will bring you here:
- Click “FASTA” underneath the sequence number.
- This will bring you to the protein sequence to the gene DMNT1 in Zebra Fish.
- Copy the entire protein sequence with the accession number in front.
- After you copy the protein sequence with the accession number, click the NCBI button in the top left corner where the arrow is pointing to in the image above.
- This will take you to the home page; click BLAST on the right side of the screen.
- This will take you to an option to either blast a nucleotide or a protein sequence. We want the protein sequence since that is the gene that we pulled for DMNT1 from the Zebra Fish, and protein sequences are more informative for finding homologs (can you guess why?).
- Select Protein BLAST.
- This is what you should see in front of you at this point:
- Paste the protein sequence with the accession number into the box provided. To ensure that we get a list of coleoptera instead of every taxa in the data base, type coleoptera into the “Organism” filter option in the box provided. Then click “BLAST” at the bottom of the screen.
- A loading screen will appear that looks like this:
- Take a break, grab a drink or a snack because this may take a few minutes.
- Once it is done loading your screen will look like this:
Yay! You found the DMNT1 gene in coleoptera! However, we’re not done yet. We’ll want to grab the methyltransferase with the highest Query Cover.
- Click the link on the left that has the highest Query Cover.
- It should take you to a page that looks like this. Click the Sequence ID.
- It will open a new tab and show you this window. Click the FASTA link near the top of the page.
- It will take you to the protein sequence of the taxa we clicked on with the accession number in front. Copy this entire sequence including the accession number.
- Once you have copied the sequence with the accession number, click the NCBI logo in the top left corner.
- This will take you back to the home page. Click on BLAST along the right side again.
- Choose the protein BLAST on the right side. Paste the sequence into the FASTA sequence input box. This time we are going to BLAST our sequence without filtering the searches for coleoptera since we are already using a coleoptera protein sequence.
- Make sure to change the job title to something that makes most sense to you. Leave the Organism filter option blank and click “BLAST” at the bottom.
- You will see the loading screen another time; get another drink, have a dance party and your results should come in in no time!
- Look at all those beautiful Query Cover and quality percentages! This is what it looks like to find good hits for homologs of the coleoptera protein sequence we found in step 16. Click on the “Taxonomy” button to see a list of all of the taxa we found.
- You will see this:
- Click the “Organism” tab.
- You will see a list of taxa on the right, with their corresponding accession number listed along the right side. It should look like this:
- This next part can be a little tedious, but it is the easiest way to compile all the FASTA files and turn them into a phylogenetic tree. Click on the accession number one by one. It will take you to a page you have seen before. Click the “FASTA” link near the top and copy the protein sequence over to a word document. You will do the same thing for each of the coleoptera taxa in the list copying and pasting the sequences with their accession numbers onto the same word document. Make sure to save it in a place you will remember and name the file after the gene we are working with (i.e. DMNT1_Homologs_FASTA)
- Once you have done that with each of the coleoptera taxa in the list, you can copy the entire word document and head over to : http://www.phylogeny.fr/simple_phylogeny.cgi. Note: NCBI will give you trees of your hits as well, but let’s use a dedicated phylogenetics site instead.
- You will see a web page that looks like this. Paste the entire word document you made with all the coleoptera FASTA sequences and their accession numbers into the input box provided. Make sure you name the job (i.e. DMNT1_Homologs_Coleoptera). You can enter your email and it will notify you when the tree has been generated because it takes a while sometimes.
- Congratulations! You generated your first phylogenetic tree! It should look similar to this:
Things you will want to watch out for:
- Hits with low quality reads.
Hits with an e-value under 1e-10 are considered very high-quality hits. Anything above that is considered low-quality (the cut off can vary depending on what you are doing). You can also see this reflected in the low Query Cover percentages and compare these with the reads we found earlier in coleoptera.
When you pull your FASTA sequences into a document to prepare them for the website to generate your phylogenetic trees, you may want to take a look at the name of the sequence you are pulling from to insure that you are not pulling multiples of the same sequence, to make your phylogenetic tree as clean and easy to read as possible.
Creating a spreadsheet for tracking if each taxa is homolog for each gene:
Generate an excel spreadsheet with each taxa listed along the y axis and each gene listed along the x-axis. To check if each taxa is a homolog for each gene, you can paste your gene protein sequence into the Blast query and type in the specific taxa into the organism filter box. You will have to really look at the genes listed in the Blast results to ensure they match the gene you searched with. Normally is it is a homolog it will have high scores for the results and low scores if it is not a homolog and the gene you searched for will not be listed. Don’t forget to include your outgroup (in our case it is listed as the bottom – Apis Millifera)!
Below is an example of what this looks like:
|Chrysina resplendens||Not Listed||Not Listed||Not Listed|
|Diabrotica virgifera virgifera||Y||Y||Y|
|Marronus borbonicus||Not Listed||Not Listed||Not Listed|
This is a bit hard to interpret as it is… so we’ll show you how to map these traits to a species tree in a future post! Stay tuned!