Buildable assets

Refgenie can build a handful of assets for which we have already created building recipes. refgenie list lists all assets refegenie can build:

$ refgenie list

Local recipes: bismark_bt1_index, bismark_bt2_index, bowtie2_index, bwa_index, dbnsfp, ensembl_gtf, ensembl_rb, epilog_index, fasta, feat_annotation, gencode_gtf, hisat2_index, kallisto_index, refgene_anno, salmon_index, star_index

If you want to add a new asset, you'll have to work with us to provide a script that can build it, and we can incorporate it into refgenie. If you have assets that cannot be scripted, or you want to add some other custom asset you may manually add custom assets and still have them managed by refgenie. We expect this will get much easier in the future.

Below, we go through the assets you can build and how to build them.

Top-level assets you can build

fasta

required input: --fasta example_genome.fa.gz
required asset: none
required software: samtools

We recommend for every genome, you first build the fasta asset, because it's a starting point for building a lot of other assets.

Example fasta files:

wget http://big.databio.org/example_data/rCRS.fa.gz
refgenie build rCRS/fasta --fasta rCRS.fa.gz
refgenie seek rCRS/fasta

refgene_anno

required input: --refgene refGene.txt.gz
required asset: none
required software: none

The refgene_anno asset is used to produce derived assets including transcription start sites (TSSs), exons, introns, and premature mRNA sequences.

Example refGene annotation files:

wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
refgenie build hg38/refgene_anno --refgene refGene.txt.gz

gencode_gtf

required input: --gencode gencode.gtf.gz
required asset: none
required software: none

The gencode_gtf asset contains all annotated transcripts.

Example gencode files:

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/gencode.vM23.annotation.gtf.gz
refgenie build mm10/gencode_gtf --gencode_gtf gencode.vM23.annotation.gtf.gz

ensembl_gtf

required input: --ensembl ensembl.gtf.gz
required asset: none
required software: none

The ensembl_gtf asset is used to build other derived assets including a comprehensive TSS annotation and gene body annotation.

Example Ensembl files:

wget ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz
refgenie build hg38/ensembl-gtf --ensembl_gtf Homo_sapiens.GRCh38.97.gtf.gz

ensembl_rb

required input: --gff regulatory_features.gff.gz
required asset: none
required software: none

The ensembl_rb asset is used to produce derived assets including feature annotations.

Example Ensembl files:

wget ftp://ftp.ensembl.org/pub/current_regulation/homo_sapiens/homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20190329.gff.gz
refgenie build hg38/ensembl_rb --gff homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20190329.gff.gz

dbnsfp

required input: --dbnsfp dbNSFP4.0a.zip
required asset: none
required software: none

The dbnsfp asset is the annotation database for non-synonymous SNPs.

wget ftp://dbnsfp:[email protected]/dbNSFP4.0a.zip
refgenie build test/dbnsfp --dbnsfp dbNSFP4.0a.zip

Derived assets you can build

For many of the following derived assets, you will need the corresponding software to build the asset. You can either install software on a case-by-case basis natively, or you can build the assets using docker.

bowtie2_index

required input: none
required asset: fasta
required software: bowtie2

refgenie build test/bowtie2_index

bismark_bt1_index and bismark_bt2_index

required input: none
required asset: fasta
required software: bismark

refgenie build test/bismark_bt1_index
refgenie build test/bismark_bt2_index

bwa_index

required input: none
required asset: fasta
required software: bwa

refgenie build test/bwa_index

hisat2_index

required input: none
required asset: fasta
required software: hisat2

refgenie build test/hisat2_index

epilog_index

required input: --context (e.g. CG)
required asset: fasta
required software: epilog

refgenie build test/epilog_index --context CG

kallisto_index

required input: none
required asset: fasta
required software: kallisto

refgenie build test/kallisto_index

salmon_index

required input: none
required asset: fasta
required software: salmon

refgenie build test/salmon_index

star_index

required input: none
required asset: fasta
required software: star

refgenie build test/star_index

feat_annotation

required input: none
required asset: ensembl_gtf, ensembl_rb
required software: none

The feat_annotation asset includes the following genomic feature annotations: enhancers, promoters, promoter flanking regions, 5' UTR, 3' UTR, exons, and introns.

refgenie build test/feat_annotation