The Sequence Read Archive (SRA) is a publicly available repository of sequence data and just one of the many databases hosted by the NIH’s National Center for Biotechnology Information (NCBI). A neat feature of the SRA is that it includes data submitted to NCBI, the European Bioinformatics Institute (EMBL-EBI), and the DNA Data Bank of Japan (DDBJ). The archive consists of raw sequences from various NGS platforms and, more recently, alignment data. Submitting sequences to the database allows researchers to contribute to the reproducibility efforts of the broader community. SRA reads are typically pulled for increasing sample sizes and validating experimental results, but the creative and resourceful investigator can mine the database to explore new research questions.
The SRA has steadily grown since 2007; today, February 22, 2021, it contains 21 million gigabases of open-access reads! Thankfully NCBI provides us with streamlined search capabilities and the SRA-Toolkit for downloading our search results and converting them to standard file formats. NCBI also offers a useful guide that outlines the steps to searching for, downloading, and converting SRA data. Nevertheless, the NCGAS team finds that many users unfamiliar with SRA-Toolkit or the command line struggle with configuring the program or parsing its multiple commands and options. Here, I aim to condense and distill this information so that a beginner can use the SRA to their heart’s content!
Step 1: Search the SRA and generate an accession list
Let’s say I want to search the SRA for all poison frog sequences (family Dendrobatidae) that are referenced by PubMed and PubMed Central. I can do this with these advanced search options:

Next, I’ll download an accession list for all 156 SRA hits using NCBI’s ‘Send to:’ dropdown menu.

This will download a text file named SraAccList.txt that lists the Run accessions from your search. As you’ll see, SRA-Toolkit will accept this text file as input, so you don’t have to enter hundreds of accessions manually.
Step 2: Configure SRA-Toolkit
As we noted in a previous blog post, the default configuration for SRA-Toolkit will store temporary files generated during sequence download in your home directory. Since your home space is just 100GB, we need to change it to a more suitable location.
Follow these instructions in your shell:
module load sra-toolkit vdb-config -i
Use tab (on a PC) or option + tab (on a Mac) to move the red cursor to [Change]. Press enter or space.

Navigate to [Goto] to type in or copy/paste a path into the field or [Create Dir] to create a new subdirectory. Enter [OK].
By default, SRA-Toolkit will create an sra subdirectory in whichever directory you specify, so keep this in mind when choosing or naming your directory.
![Image showing the [Change] window where users can navigate to a directory.](https://blogs.iu.edu/ncgas/files/2021/02/beginner_sra_2-781x1024.png)
Agree to your proposed directory change, save [6] and then exit [7].
That’s it! Your SRA-Toolkit is properly configured.
Step 3: Pull sequences from the SRA
Use SRA-Toolkit’s prefetch command to pull runs from the SRA. prefetch will download sequences in the .sra format, along with temporary files that it needs to convert .sra to more useful file formats. Note that prefetch will download your files to the directory you specified in Step 2. In my case, this is /N/slate/layfreeb/
For single accessions, simply do
prefetch SRR3900953 ls /N/slate/layfreeb/sra SRR3900953.sra
For lists of run accessions, like the one I downloaded in Step 1, it’s easiest to copy and paste from your computer’s text editor into a new file in your working directory (e.g., a sub-directory in Slate or Slate-Project).
If you have lots of accessions in SraAccList.txt, I recommend doing prefetch on an interactive node.
qsub -I -q interactive -l nodes=1:ppn=1,walltime=2:00:00
module load sra-toolkit
nano SraAccList.txt #save your changes with ctrl+O
prefetch --option-file <path to SraAccList.txt>
Step 4: Convert sequences from .sra to common file formats
Use SRA-Toolkit’s fasterq-dump or sam-dump to convert from .sra to .fastq or .sam, respectively.
For single files:
fasterq-dump /N/slate/layfreeb/sra/SRR3900953.sra ls SRR3900953.sra_1.fastq SRR3900953.sra_2.fastq
As you can see above, fasterq-dump splits .sra files into foward and reverse sequences for paired reads, denoted by _1 and _2. Also, note that the command outputs to your current working directory unless otherwise specified.
fasterq-dump will automatically delete the temporary files it created, but if you don’t need the .sra files then it’s a good idea to delete them:
rm /N/slate/layfreeb/sra/*.sra
Hot tip! You can skip the prefetch step for single runs by leaving off the .sra extension:
fasterq-dump SRR11192680 ls SRR3900953.sra_1.fastq SRR3900953.sra_2.fastq rm /N/slate/layfreeb/sra/SRR3900953.sra
Skipping prefetch creates an SRRXXXXX.sra.cache file that isn’t automatically deleted, so don’t forget to remove it:
rm /N/slate/layfreeb/sra/SRR3900953.sra.cache
sam-dump will default to STDOUT, which isn’t terribly useful, so use the –output-file flag to specify a file name and/or path.
sam-dump /N/slate/layfreeb/sra/SRR3900953.sra --output-file SRR3900953.sam rm /N/slate/layfreeb/sra/SRR3900953.sra
For multiple .sra conversions, a simple loop does the trick:
for f in *.sra; do fasterq-dump $f; done
for f in *.sra; do sam-dump $f --output-file ${f%.sra}.sam;done
Done and done! You should be well on your way to using the SRA. For more advanced options, don’t forget to use -h
sam-dump -h fasterq-dump -h prefetch -h