Building assets for custom genomes

Once you've installed refgenie, you can use refgenie pull to download pre-built assets without installing any additional software. However, you may need to use the build function for genomes or assets that are not available on the server. You can build assets for any genome for which you can provide the required inputs.

Building assets is a bit more complicated than pulling them. If you want to build assets, you'll need to get the software required by the asset you want to build. You have three choices to get that software: you can either install it natively, or use a docker image (details further down this page), or you can use our new bulker manifest. This will start a pipeline that will create the requested asset and populate the genome config file for you. You can see the example build output.

Once you're set up, you simply run refgenie build, passing it any necessary input arguments called for by the asset recipe. Each asset requires some input. For many of the built-in recipes, this is just a FASTA file. To learn what are the required inputs or other asset depedancies, add an -r flag to the refgenie build command:

$ refgenie build hg38/bowtie2_index -r

'hg38/bowtie2_index' build requirements: 
- assets: fasta

In this case you'll need to build the fasta asset for hg38 genome before building bowtie2_index. Notice how 'fasta' appears under assets and not under arguments. What this means is that to build a bowtie2 index, you do not provide a fasta file as an argument, as you might expect. Instead, you must already have a fasta asset managed by refgenie. One of the advantages of this is that it allows refgenie to keep a record of how you've built your assets, so refgenie can remember the link between this bowtie2 asset and the fasta asset, which turns out to be very useful for maintaining provenance of your assets. It also makes it easier to build these kind of derived assets, because you don't actually have to pass any additional arguments to build them.

What assets can refgenie build?

At the moment the building functionality is under rapid development and may change in the future. While refgenie is totally flexible with respect to genome, it is more restricted in terms of what assets it can build. We are planning to allow users to specify their own recipes for arbitrary assets, but at the moment, refgenie can only build a handful of assets for which we have already created building recipes. Refgenie comes with built-in recipes to build indexes for common tools like bowtie2, hisat2, bismark, salmon, bwa, and a few others. If you type refgenie list, you'll get a list of all the assets you can build with refgenie (these show up under recipes). If you want to add a new asset, you'll have to work with us to provide a script that can build it, and we can incorporate it into refgenie. We expect this will get much easier in the future.

Below, we go through the assets you can build and how to build them.

Examples for top-level assets you can build

fasta

We recommend for every genome, you first build the fasta asset, because it's a starting point for building a lot of other assets. You just have to give a compressed fasta file.

Some examples are:

  • hg19: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
  • hg38: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
  • mm10: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/635/GCA_000001635.5_GRCm38.p3/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001635.5_GRCm38.p3_no_alt_analysis_set.fna.gz
  • This README describes the sequences.
export REFGENIE="test.yaml"
refgenie build test/fasta --fasta rCRS.fa.gz
refgenie seek test/fasta

refgene_anno

A refgene annotation file is used to build several other derived assets.

Some examples:

  • hg19: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz
  • hg38: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
  • mm10: http://hgdownload.cse.ucsc.edu/goldenPath/mm10/database/refGene.txt.gz
  • rn6: http://hgdownload.cse.ucsc.edu/goldenPath/rn6/database/refGene.txt.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
refgenie build hg38/refgene_anno --refgene refGene.txt.gz

gencode_gtf

The gencode_gtf asset just copies over a GTF annotation file provided by gencode.

Some examples are:

  • hg19: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
  • hg38: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.annotation.gtf.gz
  • mm10: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M22/gencode.vM22.annotation.gtf.gz

Build the asset like:

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
refgenie build hg19/gencode_gtf --gencode_gtf gencode.v19.annotation.gtf.gz

ensembl_gtf

wget ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz
refgenie build hg38/ensembl-gtf --ensembl_gtf Homo_sapiens.GRCh38.97.gtf.gz

Some examples are:

  • hg38: ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz
  • hg19: ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
  • mm10: ftp://ftp.ensembl.org/pub/release-97/gtf/mus_musculus/Mus_musculus.GRCm38.97.gtf.gz
  • rn6: ftp://ftp.ensembl.org/pub/release-97/gtf/rattus_norvegicus/Rattus_norvegicus.Rnor_6.0.97.gtf.gz

ensembl_rb

This is the ensembl regulatory build. It requires an input gff file.

Some examples are:

  • hg38: ftp://ftp.ensembl.org/pub/release-96/regulation/homo_sapiens/homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20190122.gff.gz
  • mm10: ftp://ftp.ensembl.org/pub/release-97/regulation/mus_musculus/mus_musculus.GRCm38.Regulatory_Build.regulatory_features.20180516.gff.gz
wget ftp://ftp.ensembl.org/pub/release-96/regulation/homo_sapiens/homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20190122.gff.gz
refgenie build hg38/ensembl_rb --gff homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20190122.gff.gz

Examples for derived assets you can build

bowtie2 index

The bowtie2_index asset doesn't require any input, but does require that you've already built the fasta asset. So, first build the fasta asset for your genome of interest, and then you just build the bowtie2_index asset with no other requirements:

refgenie build test/bowtie2_index -d 

bismark indexes

The bismark_index assets doesn't require any input, but does require that you've already built the fasta asset.

refgenie build test/bismark_bt2_index -d -R

ensembl_gtf

The `ensembl_gtf asset is a copy of the ENSEMBL annotation file. You could build it like this:

wget ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz
refgenie build hg38/ensembl_gtf --ensembl-gtf Homo_sapiens.GRCh38.97.gtf.gz

Install building software natively

Refgenie expects to find in your PATH any tools needed for building a desired asset. You'll need to follow the instructions for each of these individually. You could find some basic ideas for how to install these programatically in the dockerfile. At the moment, the build system is not very flexible, and we don't have great documentation for what is required if you want to use this native approach. In our next major update, we're planning to revamp this system to provide a much more robust build system.

Building assets with docker

If you don't want to install all the software needed to build all these assets (and I don't blame you), then you can just use docker. Each of our recipes knows about a docker image that has everything it needs. If you have docker installed, you should be able to simply run refgenie build with the -d flag. For example:

refgenie build -d genome/asset ...

This tells refgenie to execute the building in a docker container requested by the particular asset recipe you specify. Docker will automatically pull the image it needs when you call this. If you like, you can build the docker container yourself like this:

git clone https://github.com/databio/refgenie.git
cd refgenie/containers
make refgenie

or pull it directly from dockerhub like this:

docker pull databio/refgenie

Versioning the assets

refgenie supports tags to facilitate management of multiple "versions" of the same asset. Simply add a :your_tag_name appendix to the asset registry path in the refgenie build command and the created asset will be tagged:

refgenie build hg38/bowtie2_index:my_tag

You can also learn more about tagging refgenie assets.