Building assets for custom genomes
Once you've installed refgenie, you can use
refgenie pull to download pre-built assets without installing any additional software. However, you may need to use the
build function for genomes or assets that are not available on the server. You can build assets for any genome for which you can provide the required inputs.
Building assets is a bit more complicated than pulling them. If you want to build assets, you'll need to get the software required by the asset you want to build. You have three choices to get that software: you can either install it natively, or use a docker image (details further down this page), or you can use our new bulker manifest. This will start a pipeline that will create the requested asset and populate the genome config file for you. You can see the example build output.
Once you're set up, you simply run
refgenie build, passing it any necessary input arguments called for by the asset recipe. Each asset requires some input. For many of the built-in recipes, this is just a FASTA file. To learn what are the required inputs or other asset depedancies, add an
-r flag to the
refgenie build command:
$ refgenie build hg38/bowtie2_index -r 'hg38/bowtie2_index' build requirements: - assets: fasta
In this case you'll need to build the
fasta asset for
hg38 genome before building
bowtie2_index. Notice how 'fasta' appears under
assets and not under
arguments. What this means is that to build a bowtie2 index, you do not provide a fasta file as an argument, as you might expect. Instead, you must already have a fasta asset managed by refgenie. One of the advantages of this is that it allows refgenie to keep a record of how you've built your assets, so refgenie can remember the link between this bowtie2 asset and the fasta asset, which turns out to be very useful for maintaining provenance of your assets. It also makes it easier to build these kind of derived assets, because you don't actually have to pass any additional arguments to build them.
What assets can refgenie build?
At the moment the building functionality is under rapid development and may change in the future. While
refgenie is totally flexible with respect to genome, it is more restricted in terms of what assets it can build. We are planning to allow users to specify their own recipes for arbitrary assets, but at the moment,
refgenie can only build a handful of assets for which we have already created building recipes. Refgenie comes with built-in recipes to build indexes for common tools like bowtie2, hisat2, bismark, salmon, bwa, and a few others. If you type
refgenie list, you'll get a list of all the assets you can build with refgenie (these show up under recipes). If you want to add a new asset, you'll have to work with us to provide a script that can build it, and we can incorporate it into
refgenie. We expect this will get much easier in the future.
Below, we go through the assets you can build and how to build them.
Examples for top-level assets you can build
We recommend for every genome, you first build the
fasta asset, because it's a starting point for building a lot of other assets. You just have to give a compressed fasta file.
Some examples are:
- hg19: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
- hg38: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
- mm10: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/635/GCA_000001635.5_GRCm38.p3/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001635.5_GRCm38.p3_no_alt_analysis_set.fna.gz
- This README describes the sequences.
export REFGENIE="test.yaml" refgenie build test/fasta --fasta rCRS.fa.gz refgenie seek test/fasta
A refgene annotation file is used to build several other derived assets.
- hg19: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz
- hg38: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
- mm10: http://hgdownload.cse.ucsc.edu/goldenPath/mm10/database/refGene.txt.gz
- rn6: http://hgdownload.cse.ucsc.edu/goldenPath/rn6/database/refGene.txt.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz refgenie build hg38/refgene_anno --refgene refGene.txt.gz
The gencode_gtf asset just copies over a GTF annotation file provided by gencode.
Some examples are:
- hg19: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
- hg38: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.annotation.gtf.gz
- mm10: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M22/gencode.vM22.annotation.gtf.gz
Build the asset like:
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz refgenie build hg19/gencode_gtf --gencode_gtf gencode.v19.annotation.gtf.gz
wget ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz refgenie build hg38/ensembl-gtf --ensembl_gtf Homo_sapiens.GRCh38.97.gtf.gz
Some examples are:
- hg38: ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz
- hg19: ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
- mm10: ftp://ftp.ensembl.org/pub/release-97/gtf/mus_musculus/Mus_musculus.GRCm38.97.gtf.gz
- rn6: ftp://ftp.ensembl.org/pub/release-97/gtf/rattus_norvegicus/Rattus_norvegicus.Rnor_6.0.97.gtf.gz
This is the ensembl regulatory build. It requires an input
Some examples are:
- hg38: ftp://ftp.ensembl.org/pub/release-96/regulation/homo_sapiens/homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20190122.gff.gz
- mm10: ftp://ftp.ensembl.org/pub/release-97/regulation/mus_musculus/mus_musculus.GRCm38.Regulatory_Build.regulatory_features.20180516.gff.gz
wget ftp://ftp.ensembl.org/pub/release-96/regulation/homo_sapiens/homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20190122.gff.gz refgenie build hg38/ensembl_rb --gff homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20190122.gff.gz
Examples for derived assets you can build
bowtie2_index asset doesn't require any input, but does require that you've already built the
fasta asset. So, first build the
fasta asset for your genome of interest, and then you just build the
bowtie2_index asset with no other requirements:
refgenie build test/bowtie2_index -d
bismark_index assets doesn't require any input, but does require that you've already built the
refgenie build test/bismark_bt2_index -d -R
The `ensembl_gtf asset is a copy of the ENSEMBL annotation file. You could build it like this:
wget ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz refgenie build hg38/ensembl_gtf --ensembl-gtf Homo_sapiens.GRCh38.97.gtf.gz
Install building software natively
Refgenie expects to find in your
PATH any tools needed for building a desired asset. You'll need to follow the instructions for each of these individually. You could find some basic ideas for how to install these programatically in the dockerfile. At the moment, the build system is not very flexible, and we don't have great documentation for what is required if you want to use this native approach. In our next major update, we're planning to revamp this system to provide a much more robust build system.
Building assets with docker
If you don't want to install all the software needed to build all these assets (and I don't blame you), then you can just use docker. Each of our recipes knows about a docker image that has everything it needs. If you have docker installed, you should be able to simply run
refgenie build with the
-d flag. For example:
refgenie build -d genome/asset ...
This tells refgenie to execute the building in a docker container requested by the particular asset recipe you specify. Docker will automatically pull the image it needs when you call this. If you like, you can build the docker container yourself like this:
git clone https://github.com/databio/refgenie.git cd refgenie/containers make refgenie
or pull it directly from dockerhub like this:
docker pull databio/refgenie
Versioning the assets
refgenie supports tags to facilitate management of multiple "versions" of the same asset. Simply add a
:your_tag_name appendix to the asset registry path in the
refgenie build command and the created asset will be tagged:
refgenie build hg38/bowtie2_index:my_tag
You can also learn more about tagging refgenie assets.