Retrieve paths to assets

Once you've assembled a few assets, either by downloading or by building them, you'll be able to use refgenie seek to retrieve the paths. It's quite simple, really -- say you've built the bowtie2_index and fasta assets for hg38. If you type:

refgenie seek -c CONFIG.yaml hg38/bowtie2_index

You'll get back the absolute path on your system to the bowtie2_index asset, something like:

/path/to/genomes_folder/hg38/bowtie2_index

Seek keys

Refgenie seek uses a concept called seek keys. Let's formally define how these differ from an asset:

  • asset: a folder consisting of one or more files related to a specific reference genome
  • seek key: a string identifier for a particular file or folder contained within an asset that can be located with refgenie seek.

As asset can have multiple seek keys. When an asset recipe is defined, the author records which files in the asset are potentially interesting for a user to retrieve a path to. These files are added to the recipe as seek keys.

Many assets have only a single seek key, because the asset really only consists of a single thing. For example, a bowtie2_index asset has only 1 thing you'd need to seek to: the folder containing the indexes. But other assets can have multiple seek keys. For example, the fasta asset provides not only a fasta file, but a .fai fasta index as well as a chrom_sizes file that is required by some tools (notably, ucsc tools). These files are really tightly coupled with a particular fasta file and it doesn't make sense to make separate assets for each of them, and this is where seek keys come in.

Once you have built a fasta asset, you'll be able to retrieve not just the fasta file, but also the fai, and chrom_sizes files. These will show up when you type refgenie list as fasta.fai and fasta.chrom_sizes seek keys.

Seek keys make things portable

Why is this useful? Instead of writing your shell script to take a path to each the chrom_sizes file and the bowtie2_index path, you just take the genome name, and use refgenie seek to find the correct path. Not only have you simplified the interface to your analysis, but both the script itself and your call of it are now portable.

In other words, you just write:

python script.py --genome hg38

and then use refgenie to seek any assets you need. Previously, you would have done:

python script.py --chrom_sizes /local/path/to/hg38/chrom_size \
  --bt2_index /local/path/to/hg38/bowtie2_index

This second command obviously differs by computing environment, whereas the first does not.

Examples

GENOME="hg38"

refgenie seek -g $GENOME -a bowtie2_index
refgenie seek -g $GENOME -a fasta.chrom_sizes
refgenie seek -g $GENOME -a fasta

or:

refgenie seek hg38/bowtie2_index
refgenie seek hg38/fasta.chrom_sizes
refgenie seek hg38/fasta