Tutorial

I assume you've already installed refgenie. In this tutorial I'll show you a few ways to use refgenie from the command line (commands that start with a !), and also some Python commands.

To start, initialize an empty refgenie configuration file from the shell:

!refgenie init -c refgenie.yaml
Initialized genome configuration file: /home/nsheff/code/refgenie/docs_jupyter/refgenie.yaml

Here's what it looks like:

!cat refgenie.yaml
config_version: 0.3
genome_folder: /home/nsheff/code/refgenie/docs_jupyter
genome_servers: ['http://refgenomes.databio.org']
genomes: null

Now let's enter python and do some stuff.

import refgenconf
rgc = refgenconf.RefGenConf("refgenie.yaml")

Use pull to download the actual asset:

rgc.pull("hs38d1", "fasta", "default")

(['hs38d1', 'fasta', 'default'],
 {'archive_digest': '310c578812a64fcdf08d2df60d7b79b4',
  'archive_size': '1.7MB',
  'asset_children': ['hs38d1/star_index:default',
   'hs38d1/bwa_index:default',
   'hs38d1/bowtie2_index:default',
   'hs38d1/bismark_bt1_index:default',
   'hs38d1/bismark_bt2_index:default',
   'hs38d1/hisat2_index:default',
   'hs38d1/tallymer_index:default',
   'hs38d1/suffixerator_index:default'],
  'asset_digest': 'eddf5466faa3391a7114e87648466dcb',
  'asset_parents': [],
  'asset_path': 'fasta',
  'asset_size': '6.0MB',
  'seek_keys': {'chrom_sizes': 'hs38d1.chrom.sizes',
   'fai': 'hs38d1.fa.fai',
   'fasta': 'hs38d1.fa'}},
 'http://refgenomes.databio.org')

Once it's downloaded, use seek to retrieve a path to it.

rgc.seek("hs38d1", "fasta")
'/home/nsheff/code/refgenie/docs_jupyter/hs38d1/fasta/default/hs38d1.fa'

You can get the unique asset identifier with id()

rgc.id("hs38d1", "fasta")
'eddf5466faa3391a7114e87648466dcb'

Building and pulling from the command line

Here, we can build a fasta asset instead of pulling one. Back to the shell, we'll grab the Revised Cambridge Reference Sequence (human mitochondrial genome, because it's small):

!wget http://big.databio.org/refgenie_raw/rCRSd.fa.gz
--2020-03-13 16:11:59--  http://big.databio.org/refgenie_raw/rCRSd.fa.gz
Resolving big.databio.org (big.databio.org)... 128.143.245.181
Connecting to big.databio.org (big.databio.org)|128.143.245.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8399 (8.2K) [application/octet-stream]
Saving to: ‘rCRSd.fa.gz’

rCRSd.fa.gz         100%[===================>]   8.20K  --.-KB/s    in 0s      

2020-03-13 16:11:59 (214 MB/s) - ‘rCRSd.fa.gz’ saved [8399/8399]


!refgenie build rCRSd/fasta -c refgenie.yaml  --files fasta=rCRSd.fa.gz -R
Using 'default' as the default tag for 'rCRSd/fasta'
Building 'rCRSd/fasta:default' using 'fasta' recipe
Saving outputs to:
- content: /home/nsheff/code/refgenie/docs_jupyter/rCRSd
- logs: /home/nsheff/code/refgenie/docs_jupyter/rCRSd/fasta/default/_refgenie_build
### Pipeline run code and environment:

*              Command:  `/home/nsheff/.local/bin/refgenie build rCRSd/fasta -c refgenie.yaml --files fasta=rCRSd.fa.gz -R`
*         Compute host:  puma
*          Working dir:  /home/nsheff/code/refgenie/docs_jupyter
*            Outfolder:  /home/nsheff/code/refgenie/docs_jupyter/rCRSd/fasta/default/_refgenie_build/
*  Pipeline started at:   (03-13 16:11:59) elapsed: 0.0 _TIME_

### Version log:

*       Python version:  3.7.6
*          Pypiper dir:  `/home/nsheff/.local/lib/python3.7/site-packages/pypiper`
*      Pypiper version:  0.12.1
*         Pipeline dir:  `/home/nsheff/.local/bin`
*     Pipeline version:  None

### Arguments passed to pipeline:

* `asset_registry_paths`:  `['rCRSd/fasta']`
*             `assets`:  `None`
*            `command`:  `build`
*        `config_file`:  `refgenie.yaml`
*             `docker`:  `False`
*              `files`:  `[['fasta=rCRSd.fa.gz']]`
*             `genome`:  `None`
*      `genome_config`:  `refgenie.yaml`
* `genome_description`:  `None`
*             `logdev`:  `False`
*          `new_start`:  `False`
*          `outfolder`:  `/home/nsheff/code/refgenie/docs_jupyter`
*             `params`:  `None`
*             `recipe`:  `None`
*            `recover`:  `True`
*       `requirements`:  `False`
*             `silent`:  `False`
*    `tag_description`:  `None`
*          `verbosity`:  `None`
*            `volumes`:  `None`

----------------------------------------

Target to produce: `/home/nsheff/code/refgenie/docs_jupyter/rCRSd/fasta/default/_refgenie_build/rCRSd_fasta__default.flag`  

> `cp rCRSd.fa.gz /home/nsheff/code/refgenie/docs_jupyter/rCRSd/fasta/default/rCRSd.fa.gz` (28689)
<pre>
</pre>
Command completed. Elapsed time: 0:00:00. Running peak memory: 0GB.  
  PID: 28689;   Command: cp;    Return code: 0; Memory used: 0.0GB


> `gzip -d /home/nsheff/code/refgenie/docs_jupyter/rCRSd/fasta/default/rCRSd.fa.gz` (28691)
<pre>
</pre>
Command completed. Elapsed time: 0:00:00. Running peak memory: 0GB.  
  PID: 28691;   Command: gzip;  Return code: 0; Memory used: 0.0GB


> `samtools faidx /home/nsheff/code/refgenie/docs_jupyter/rCRSd/fasta/default/rCRSd.fa` (28693)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.018GB.  
  PID: 28693;   Command: samtools;  Return code: 0; Memory used: 0.018GB


> `cut -f 1,2 /home/nsheff/code/refgenie/docs_jupyter/rCRSd/fasta/default/rCRSd.fa.fai > /home/nsheff/code/refgenie/docs_jupyter/rCRSd/fasta/default/rCRSd.chrom.sizes` (28761)
<pre>
</pre>
Command completed. Elapsed time: 0:00:00. Running peak memory: 0.018GB.  
  PID: 28761;   Command: cut;   Return code: 0; Memory used: 0.0GB


> `touch /home/nsheff/code/refgenie/docs_jupyter/rCRSd/fasta/default/_refgenie_build/rCRSd_fasta__default.flag` (28763)
<pre>
</pre>
Command completed. Elapsed time: 0:00:00. Running peak memory: 0.018GB.  
  PID: 28763;   Command: touch; Return code: 0; Memory used: 0.0GB


> `cd /home/nsheff/code/refgenie/docs_jupyter/rCRSd/fasta/default; find . -type f -not -path './_refgenie_build*' -exec md5sum {} \; | sort -k 2 | awk '{print $1}' | md5sum`
Asset digest: 4eb430296bc02ed7e4006624f1d5ac53
Default tag for 'rCRSd/fasta' set to: default

### Pipeline completed. Epilogue
*        Elapsed time (this run):  0:00:01
*  Total elapsed time (all runs):  0:00:01
*         Peak memory (this run):  0.0184 GB
*        Pipeline completed time: 2020-03-13 16:12:00
Computing initial genome digest...
Initializing genome...
Finished building 'fasta' asset

!refgenie seek rCRSd/fasta -c refgenie.yaml
/home/nsheff/code/refgenie/docs_jupyter/rCRSd/fasta/default/rCRSd.fa

You can do the same thing from within python:

rgc = refgenconf.RefGenConf("refgenie.yaml")
rgc.seek("rCRSd", "fasta")
'/home/nsheff/code/refgenie/docs_jupyter/rCRSd/fasta/default/rCRSd.fa'

Now if you have bowtie2-build in your PATH you can build the bowtie2 index with no further requirements.

You can see the requirements with --requirements:

!refgenie build rCRSd/bowtie2_index -c refgenie.yaml --requirements
'bowtie2_index' recipe requirements: 
- assets:
    fasta (fasta asset for genome); default: fasta

Since I already have the fasta asset, that means I don't need anything else to build the bowtie2_index.

!refgenie build rCRSd/bowtie2_index -c refgenie.yaml
Using 'default' as the default tag for 'rCRSd/bowtie2_index'
Building 'rCRSd/bowtie2_index:default' using 'bowtie2_index' recipe
Saving outputs to:
- content: /home/nsheff/code/refgenie/docs_jupyter/rCRSd
- logs: /home/nsheff/code/refgenie/docs_jupyter/rCRSd/bowtie2_index/default/_refgenie_build
### Pipeline run code and environment:

*              Command:  `/home/nsheff/.local/bin/refgenie build rCRSd/bowtie2_index -c refgenie.yaml`
*         Compute host:  puma
*          Working dir:  /home/nsheff/code/refgenie/docs_jupyter
*            Outfolder:  /home/nsheff/code/refgenie/docs_jupyter/rCRSd/bowtie2_index/default/_refgenie_build/
*  Pipeline started at:   (03-13 16:12:02) elapsed: 0.0 _TIME_

### Version log:

*       Python version:  3.7.6
*          Pypiper dir:  `/home/nsheff/.local/lib/python3.7/site-packages/pypiper`
*      Pypiper version:  0.12.1
*         Pipeline dir:  `/home/nsheff/.local/bin`
*     Pipeline version:  None

### Arguments passed to pipeline:

* `asset_registry_paths`:  `['rCRSd/bowtie2_index']`
*             `assets`:  `None`
*            `command`:  `build`
*        `config_file`:  `refgenie.yaml`
*             `docker`:  `False`
*              `files`:  `None`
*             `genome`:  `None`
*      `genome_config`:  `refgenie.yaml`
* `genome_description`:  `None`
*             `logdev`:  `False`
*          `new_start`:  `False`
*          `outfolder`:  `/home/nsheff/code/refgenie/docs_jupyter`
*             `params`:  `None`
*             `recipe`:  `None`
*            `recover`:  `False`
*       `requirements`:  `False`
*             `silent`:  `False`
*    `tag_description`:  `None`
*          `verbosity`:  `None`
*            `volumes`:  `None`

----------------------------------------

Target to produce: `/home/nsheff/code/refgenie/docs_jupyter/rCRSd/bowtie2_index/default/_refgenie_build/rCRSd_bowtie2_index__default.flag`  

> `bowtie2-build /home/nsheff/code/refgenie/docs_jupyter/rCRSd/fasta/default/rCRSd.fa /home/nsheff/code/refgenie/docs_jupyter/rCRSd/bowtie2_index/default/rCRSd` (28812)
<pre>
Building a SMALL index
Settings:
  Output files: "/home/nsheff/code/refgenie/docs_jupyter/rCRSd/bowtie2_index/default/rCRSd.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  /home/nsheff/code/refgenie/docs_jupyter/rCRSd/fasta/default/rCRSd.fa
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 8284
Using parameters --bmax 6213 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 6213 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
  Building sPrime
  Building sPrimeOrder
  V-Sorting samples
  V-Sorting samples time: 00:00:00
  Allocating rank array
  Ranking v-sort output
  Ranking v-sort output time: 00:00:00
  Invoking Larsson-Sadakane on ranks
  Invoking Larsson-Sadakane on ranks time: 00:00:00
  Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
  (Using difference cover)
  Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
  Splitting and merging time: 00:00:00
Avg bucket size: 33136 (target: 6212)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 1
  No samples; assembling all-inclusive block
  Sorting block of length 33136 for bucket 1
  (Using difference cover)
  Sorting block time: 00:00:00
Returning block of 33137 for bucket 1
Exited Ebwt loop
fchr[A]: 0
fchr[C]: 10248
fchr[G]: 20610
fchr[T]: 24948
fchr[$]: 33136
Exiting Ebwt::buildToDisk()
Returning from initFromVector
Wrote 4205567 bytes to primary EBWT file: /home/nsheff/code/refgenie/docs_jupyter/rCRSd/bowtie2_index/default/rCRSd.1.bt2
Wrote 8292 bytes to secondary EBWT file: /home/nsheff/code/refgenie/docs_jupyter/rCRSd/bowtie2_index/default/rCRSd.2.bt2
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
    len: 33136
    bwtLen: 33137
    sz: 8284
    bwtSz: 8285
    lineRate: 6
    offRate: 4
    offMask: 0xfffffff0
    ftabChars: 10
    eftabLen: 20
    eftabSz: 80
    ftabLen: 1048577
    ftabSz: 4194308
    offsLen: 2072
    offsSz: 8288
    lineSz: 64
    sideSz: 64
    sideBwtSz: 48
    sideBwtLen: 192
    numSides: 173
    numLines: 173
    ebwtTotLen: 11072
    ebwtTotSz: 11072
    color: 0
    reverse: 0
Total time for call to driver() for forward index: 00:00:00
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
  Time to reverse reference sequence: 00:00:00
bmax according to bmaxDivN setting: 8284
Using parameters --bmax 6213 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 6213 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
  Building sPrime
  Building sPrimeOrder
  V-Sorting samples
  V-Sorting samples time: 00:00:00
  Allocating rank array
  Ranking v-sort output
  Ranking v-sort output time: 00:00:00
  Invoking Larsson-Sadakane on ranks
  Invoking Larsson-Sadakane on ranks time: 00:00:00
  Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
  (Using difference cover)
  Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Splitting and merging
  Splitting and merging time: 00:00:00
Avg bucket size: 33136 (target: 6212)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 1
  No samples; assembling all-inclusive block
  Sorting block of length 33136 for bucket 1
  (Using difference cover)
  Sorting block time: 00:00:00
Returning block of 33137 for bucket 1
Exited Ebwt loop
fchr[A]: 0
fchr[C]: 10248
fchr[G]: 20610
fchr[T]: 24948
fchr[$]: 33136
Exiting Ebwt::buildToDisk()
Returning from initFromVector
Wrote 4205567 bytes to primary EBWT file: /home/nsheff/code/refgenie/docs_jupyter/rCRSd/bowtie2_index/default/rCRSd.rev.1.bt2
Wrote 8292 bytes to secondary EBWT file: /home/nsheff/code/refgenie/docs_jupyter/rCRSd/bowtie2_index/default/rCRSd.rev.2.bt2
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
    len: 33136
    bwtLen: 33137
    sz: 8284
    bwtSz: 8285
    lineRate: 6
    offRate: 4
    offMask: 0xfffffff0
    ftabChars: 10
    eftabLen: 20
    eftabSz: 80
    ftabLen: 1048577
    ftabSz: 4194308
    offsLen: 2072
    offsSz: 8288
    lineSz: 64
    sideSz: 64
    sideBwtSz: 48
    sideBwtLen: 192
    numSides: 173
    numLines: 173
    ebwtTotLen: 11072
    ebwtTotSz: 11072
    color: 0
    reverse: 1
Total time for backward call to driver() for mirror index: 00:00:00
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.019GB.  
  PID: 28812;   Command: bowtie2-build; Return code: 0; Memory used: 0.019GB


> `touch /home/nsheff/code/refgenie/docs_jupyter/rCRSd/bowtie2_index/default/_refgenie_build/rCRSd_bowtie2_index__default.flag` (28879)
<pre>
</pre>
Command completed. Elapsed time: 0:00:00. Running peak memory: 0.019GB.  
  PID: 28879;   Command: touch; Return code: 0; Memory used: 0.0GB


> `cd /home/nsheff/code/refgenie/docs_jupyter/rCRSd/bowtie2_index/default; find . -type f -not -path './_refgenie_build*' -exec md5sum {} \; | sort -k 2 | awk '{print $1}' | md5sum`
Asset digest: 1262e30d4a87db9365d501de8559b3b4
Default tag for 'rCRSd/bowtie2_index' set to: default

### Pipeline completed. Epilogue
*        Elapsed time (this run):  0:00:01
*  Total elapsed time (all runs):  0:00:01
*         Peak memory (this run):  0.0188 GB
*        Pipeline completed time: 2020-03-13 16:12:03
Finished building 'bowtie2_index' asset

You can see a list of available recipes like this:

!refgenie list -c refgenie.yaml
Server subscriptions: http://refgenomes.databio.org
Local genomes: hs38d1, rCRSd
Local recipes: bismark_bt1_index, bismark_bt2_index, blacklist, bowtie2_index, bwa_index, cellranger_reference, dbnsfp, dbsnp, ensembl_gtf, ensembl_rb, epilog_index, fasta, fasta_txome, feat_annotation, gencode_gtf, hisat2_index, kallisto_index, refgene_anno, salmon_index, salmon_partial_sa_index, salmon_sa_index, star_index, suffixerator_index, tallymer_index
Local assets:
              hs38d1/   fasta.chrom_sizes:default, fasta.fai:default, fasta:default
               rCRSd/   bowtie2_index:default, fasta.chrom_sizes:default, fasta.fai:default, fasta:default

You can get the unique digest for any asset with refgenie id:

!refgenie id rCRSd/fasta -c refgenie.yaml
rCRSd/fasta:default,4eb430296bc02ed7e4006624f1d5ac53

Versions

from platform import python_version 
python_version()
'3.5.2'
!refgenie --version
refgenie 0.9.0-dev

refgenconf.__version__
'0.7.0-dev'