Building assets for custom genomes
Once you've installed refgenie, you can use
refgenie pull to download pre-built assets without installing any additional software. However, you may need to use the
build function for genomes or assets that are not available on the server. You can build assets for any genome for which you can provide the required inputs.
Building assets is a bit more complicated than pulling them. If you want to build assets, you'll need to get the software required by the asset you want to build. You have two choices to get that software: you can either install it natively, or use a docker image (details further down this page). Once you're set up, you simply run
refgenie build, passing it any necessary input arguments called for by the asset recipe.
What assets can refgenie build?
At the moment the building functionality is under rapid development and may change in the future. While
refgenie is totally flexible with respect to genome, it is more restricted in terms of what assets it can build. We are planning to allow users to specify their own recipes for arbitrary assets, but at the moment,
refgenie can only build a handful of assets for which we have already created building recipes. Refgenie comes with built-in recipes to build indexes for common tools like bowtie2, hisat2, bismark, salmon, bwa, and a few others. If you type
refgenie list, you'll get a list of all the assets you can build with refgenie. If you want to add a new asset, you'll have to work with us to provide a script that can build it, and we can incorporate it into refgenie. We expect this will get much easier in the future.
Each asset requires some input. For many of the built-in recipes, this is just a FASTA file. Below, we go through the assets you can build and how to build them.
Examples for top-level assets you can build
We recommend for every genome, you first build the
fasta asset, because it's a starting point for building a lot of other assets. You just have to give a compressed fasta file.
Some examples are:
- hg19: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
- hg38: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
- mm10: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/635/GCA_000001635.5_GRCm38.p3/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001635.5_GRCm38.p3_no_alt_analysis_set.fna.gz
- This README describes the sequences.
export REFGENIE="test.yaml" refgenie build -g test -a fasta --fasta rCRS.fa.gz refgenie seek -g test -a fasta
A refgene annotation file is used to build several other derived assets.
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz refgenie build -g hg38 -a refgene_anno --refgene refGene.txt.gz
The gencode_gtf asset just copies over a GTF annotation file provided by gencode.
- hg19: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
- hg38: GTF=ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.primary_assembly.annotation.gtf.gz
- mm10: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_mouse/release_M12/gencode.vM12.primary_assembly.annotation.gtf.gz
Build the asset like:
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.primary_assembly.annotation.gtf.gz refgenie build -g hg19 -a gencode_gtf --gtf gencode.v29.primary_assembly.annotation.gtf.gz
wget ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz refgenie build -g hg19 -a ensembl_gtf --gtf Homo_sapiens.GRCh38.97.gtf.gz
This is the ensembl regulatory build. It requires an input
wget ftp://ftp.ensembl.org/pub/release-97/regulation/homo_sapiens/homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20190329.gff.gz refgenie build -g hg19 -a ensembl_rb --gff homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20190329.gff.gz
Examples for derived assets you can build
The bowtie2_index asset doesn't require any input, but does require that you've already built the
fasta asset. So, first build the fasta asset for your genome of interest, and then you just build the
bowtie2_index asset with no other requirements:
refgenie build -g test -a bowtie2_index -d
The bismark index assets doesn't require any input, but does require that you've already built the
refgenie build -g test -a bismark_bt2_index -d -R
The ensembl_gtf asset is a copy of the ENSEMBL annotation file. You could build it like this:
wget ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz refgenie build -g hg38 -a ensembl_gtf --gtf Homo_sapiens.GRCh38.97.gtf.gz
Install building software natively
Refgenie expects to find in your
PATH any tools needed for building a desired asset. You'll need to follow the instructions for each of these individually. You could find some basic ideas for how to install these programatically in the dockerfile. At the moment, the build system is not very flexible, and we don't have great documentation for what is required if you want to use this native approach. In our next major update, we're planning to revamp this system to provide a much more robust build system.
Building assets with docker
If you don't want to install all the software needed to build all these assets (and I don't blame you), then you can just use docker. Each of our recipes knows about a docker image that has everything it needs. If you have docker installed, you should be able to simply run
refgenie build with the
-d flag. For example:
refgenie build -d --genome ... --asset ...
This tells refgenie to execute the building in a docker container requested by the particular asset recipe you specify. Docker will automatically pull the image it needs when you call this. If you like, you can build the docker container yourself like this:
git clone https://github.com/databio/refgenie.git cd refgenie/containers make refgenie
or pull it directly from dockerhub like this:
docker pull nsheff/refgenie