Buildable assets

Refgenie can build a handful of assets for which we have already created building recipes. refgenie list lists all assets refegenie can build:

$ refgenie list

Local recipes: bismark_bt1_index, bismark_bt2_index, bowtie2_index, bwa_index, dbnsfp, ensembl_gtf, ensembl_rb, epilog_index, fasta, feat_annotation, gencode_gtf, hisat2_index, kallisto_index, refgene_anno, salmon_index, star_index, suffixerator_index, tallymer_index

If you want to add a new asset, you'll have to work with us to provide a script that can build it, and we can incorporate it into refgenie. If you have assets that cannot be scripted, or you want to add some other custom asset you may manually add custom assets and still have them managed by refgenie. We expect this will get much easier in the future.

Below, we go through the assets you can build and how to build them.

Top-level assets you can build

fasta

required files: --files fasta=/path/to/fasta_file (e.g. example_genome.fa.gz) required parameters: none required asset: none required software: samtools

We recommend for every genome, you first build the fasta asset, because it's a starting point for building a lot of other assets.

Example fasta files:

wget http://big.databio.org/example_data/rCRS.fa.gz
refgenie build rCRS/fasta --files fasta=rCRS.fa.gz
refgenie seek rCRS/fasta

blacklist

required files: --files blacklist=/path/to/blacklist_file (e.g. hg38-blacklist.v2.bed.gz) required parameters: none required asset: none required software: none

The blacklist asset represents regions that should be excluded from sequencing experiments. The ENCODE blacklist represents a comprehensive listing of these regions for several model organisms [^Amemiya2019].

Example blacklist files:

wget https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz
refgenie build hg38/blacklist --files blacklist=hg38-blacklist.v2.bed.gz

refgene_anno

required files: --files refgene=/path/to/refGene_file (e.g. refGene.txt.gz) required parameters: none required asset: none required software: none

The refgene_anno asset is used to produce derived assets including transcription start sites (TSSs), exons, introns, and premature mRNA sequences.

Example refGene annotation files:

wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
refgenie build hg38/refgene_anno --files refgene=refGene.txt.gz

gencode_gtf

required files: --files gencode_gtf=/path/to/gencode_file (e.g. gencode.gtf.gz) required parameters: none required asset: none required software: none

The gencode_gtf asset contains all annotated transcripts.

Example gencode files:

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/gencode.vM23.annotation.gtf.gz
refgenie build mm10/gencode_gtf --files gencode_gtf=gencode.vM23.annotation.gtf.gz

ensembl_gtf

required files: --files ensembl_gtf=/path/to/ensembl_file (e.g. ensembl.gtf.gz) required parameters: none required asset: none required software: none

The ensembl_gtf asset is used to build other derived assets including a comprehensive TSS annotation and gene body annotation.

Example Ensembl files:

wget ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz
refgenie build hg38/ensembl-gtf --files ensembl_gtf=Homo_sapiens.GRCh38.97.gtf.gz

ensembl_rb

required files: --files gff=/path/to/gff_file (e.g. regulatory_features.ff.gz) required parameters: none required asset: none required software: none

The ensembl_rb asset is used to produce derived assets including feature annotations.

Example Ensembl files:

wget ftp://ftp.ensembl.org/pub/current_regulation/homo_sapiens/homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20190329.gff.gz
refgenie build hg38/ensembl_rb --files gff=homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20190329.gff.gz

dbnsfp

required files: --files dbnsfp=/path/to/dbnsfp_file (e.g. dbNSFP4.0a.zip) required parameters: none required asset: none required software: none

The dbnsfp asset is the annotation database for non-synonymous SNPs.

wget ftp://dbnsfp:[email protected]/dbNSFP4.0a.zip
refgenie build test/dbnsfp --files dbnsfp=dbNSFP4.0a.zip

Derived assets you can build

For many of the following derived assets, you will need the corresponding software to build the asset. You can either install software on a case-by-case basis natively, or you can build the assets using docker.

bowtie2_index

required files: none required parameters: none required asset: fasta required software: bowtie2

refgenie build test/bowtie2_index

bismark_bt1_index and bismark_bt2_index

required files: none required parameters: none required asset: fasta required software: bismark

refgenie build test/bismark_bt1_index
refgenie build test/bismark_bt2_index

bwa_index

required files: none required parameters: none required asset: fasta required software: bwa

refgenie build test/bwa_index

hisat2_index

required files: none required asset: fasta required software: hisat2

refgenie build test/hisat2_index

epilog_index

required files: none required parameters: --params context=CG (Default) required asset: fasta required software: epilog

refgenie build test/epilog_index --params context=CG

kallisto_index

required files: none required parameters: none required asset: fasta required software: kallisto

refgenie build test/kallisto_index

salmon_index

required files: none required parameters: none required asset: fasta required software: salmon

refgenie build test/salmon_index

star_index

required files: none required parameters: none required asset: fasta required software: star

refgenie build test/star_index

suffixerator_index

required files: none required parameters: --params memlimit=8GB (Default) required asset: fasta required software: GenomeTools

refgenie build test/suffixerator_index --params memlimit=8GB

tallymer_index

required files: none required parameters: --params mersize=30 minocc=2 (Default) required asset: fasta required software: GenomeTools

refgenie build test/tallymer_index --params mersize=30 minocc=2

feat_annotation

required files: none required parameters: none required asset: ensembl_gtf, ensembl_rb required software: none

The feat_annotation asset includes the following genomic feature annotations: enhancers, promoters, promoter flanking regions, 5' UTR, 3' UTR, exons, and introns.

refgenie build test/feat_annotation

[^Amemiya2019]: Amemiya HM, Kundaje A, Boyle AP. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep 2019;9, 9354. doi:10.1038/s41598-019-45839-z