We often talk about transcriptome analysis and R visualization here at NCGAS (they are our most popular workshops!), but there are other ways to visualize data outside of R. One extremely useful one is using KEGG pathways.
Step 1: Submit your amino acids
Step 2: Wait
Step 3: Retreive your results
Step 4: Immediate visualization
Step 5: Visualization later
Step 6: Advanced visualization
Step 7: Alternative inputs
Bonus: Metagenomic Analysis
There are many ways to do KEGG pathways, so let’s start with an easy one that is highly compatible with the pipeline. All you need is a set of amino acids in a fasta file!
Step 1: Submit your amino acids
For this demo, let’s start with all the amino acids from the output of our workshop demo data (available here). This first step is very easy:
- go to the website (link: https://www.kegg.jp/ghostkoala/)
- fill out the form – note: you can only have one submission per email at a time!

- click submit button
- check your email – click the submit link
Step 2: Wait
An even easier step! The job can take ~24 hours, but sometimes as little as 4 hours. It is dependent on how many transcripts you have and how busy the server is.
Step 3: Retrieve your results
You will get an email telling you that your job is complete, with a link to the results. The results page will look like this:

There are a couple of things that you can dig into here. The first is a general idea of some of the functional categories. The utility of this will depend largely on what you submit as far as amino acids. In this case, since we submitted the full set, we would expect a diversity of categories. If you were submitting just a subset, such as the differentially expressed transcripts, you may see a smaller group of functional categories found.
In this example, the test data is limited, so the results are also limited.
Another thing to look at is to preview the first 100, as it is always a good check your data when you get it!

You’ll notice that there is a number of missing annotations. This is perfectly normal! Also, it lists the second best hit, but a second-best score is provided. This can be helpful if you are really interested in a particular transcript.
If you want this level of detail, you need to download the data at the top of this page, where it says “download detail”. If you download it directly from the previous page, it will only provide the gene name and the KO.
Going back to the main results page, you can also click on the “Reconstruct Pathway” link – this will bring you to a new set of pages that allow for some quick visualization!
Step 4: Immediate Visualization

The “Reconstruct Pathway” link will lead you to a page with a list of pathways and some numbers after it. These are the standard pathways in KEGG and the number of transcripts from your input set that fall in that pathway. Let’s click on Carbon Metabolism to get a nice visual.

This automatically highlights the parts of the pathway that are present in the input set. If you are using the full assembly, this can be a good way to look at the coverage. This viewer highlights the arrows, rather than the boxed names of the enzymes in the pathway (as previous versions did). This is still useful, but if you want to do fancier things, you should continue below!
Additional instructions (and citation) available here: https://www.kegg.jp/blastkoala/help_ghostkoala.html
Step 5: Visualization later
The data you have from the results page will only be available for a couple of days. You will need to download the data (I recommend the detailed data!). However, there isn’t a convenient way to download all the pathways you may want to visualize later. There is an easy way to pull them back up in the future though – provided you have the downloaded file with the KO numbers:
- Go to the KEGG Mapper tool – link: https://www.genome.jp/kegg/tool/map_pathway.html
- Add your KO numbers into the entry box. You can pull these from the downloaded table with the following bash command:
cat user_ko_definition.txt | awk '{print $1, $2}' | grep "K" > ko.name_ko_only
This command will create a list of transcripts and KO values with blank lines from transcripts that were not annotated removed.
- KEGG will then give you a similar listing to the one you had in your initial results file, with the pathways listed and the number of hits per pathway.

- If you click a pathway, you can get:

This gives you a colored box for each transcript found in your input set. You can change the color easily in photoshop, powerpoint, etc.
Step 6: Advanced Visualization
One thing I like to do if I am using differential expression data is to make a figure that demonstrates both how much of the pathway was identified, as well as how many of those identified transcripts are differentially expressed. This allows clear visualization of what might be missing from the differential expression set. Here’s a quick hack to make this kind of figure:
- Given a list of genes of interest (i.e. DE.list) and a full list of KEGG annotated genes (i.e. user_ko_definition.txt), subset the DE.list from the KEGG annotation:
grep -f DE.list user_ko_definition.txt | awk ‘{print $2, “green”}’ > DE.KEGG.list - Then add in all the non-matching annotations, but give them a different color:
grep -vf DE.list user_ko_definition.txt | awk ‘{print $2, “yellow”}’ >> DE.KEGG.list - Then use this version of the KEGG pathway tool: https://www.kegg.jp/kegg/tool/map_pathway2.html

Note that in this example terms in green are the subset from DE.list, and all yellow are found in the full annotation. This allows you to differentiate between transcripts that may have been missed in your assembly and those that you have evidence for not being differentially expressed!
Step 7: Alternative inputs
You can also do this whole analysis with KO values from the annotation table that comes from trinotate! To generate the file of KEGG terms, you can do the following:
sed 's/\ /_/g' Report.xls | awk '{print $1,$12}' | grep "KO" | \
sed 's/[A-Za-z0-9:`]*`KO://g' > KEGG_from_annotation.name_and_ko
Then you can repeat the above steps in KEGG Mapper (starting from Step 5), without needing to use Ghost Koala!
Bonus: Metagenomic Analysis
You can also use Ghost Koala for microbiome and metagenomic data with this annotation system as well. The Meren Lab has a great post about how to do this. Definitely check it out (link here) if you want to use Anvi’o!