Genome aliases

TLDR; The genome alias system in refgenie allows users to refer to assets with arbitrary strings managed with refgenie alias command.

Motivation

Many systems rely on human-readable identifiers of genomes, such as "hg38". However, two users may refer to different things with the same identifier, such as the many slight variations of the hg38 genome assembly. Such identifier mismatches lead to compatibility issues that incur the wrath of bioinformaticians everywhere. A step toward solving this problem is to use unique identifiers that unambiguously identify a particular assembly, such as those provided by the NCBI Assembly database; however, this approach relies on a central authority, and therefore can not apply to custom genomes or custom assets. Besides, human-readable identifiers persist because there's something simple and satisfying about referring to a genome or piece of data with a simple string that makes some sense and is easy to remember, like hg38.

Solutions

Sequence-derived identifiers

Refgenie’s approach extends the refget algorithm by GA4GH, introduced in 2019 to collections of annotated sequences. This means that the unique sequence-derived genome identifier calculated by refgenie captures not only sequence content, but also related metadata like sequence names and length. So, instead of referring to human genome as, e.g. "hg38" refgenie unambiguously identifies it as 58de7f33a36ccd9d6e3b1b3afe6b9f37cd5b2867bbfb929a.

Genome namespace initialization

The genome digest is calculated based on a FASTA file once the genome namespace is first created. This can happen when the fasta asset is pulled or built.

To start, initialize an empty refgenie configuration file from the shell and subscribe to the desired asset server:

export REFGENIE=$(pwd)/refgenie.yaml
refgenie init -c $REFGENIE -s http://rg.databio.org
Initialized genome configuration file: /Users/mstolarczyk/code/refgenie/docs_jupyter/refgenie.yaml
Created directories:
 - /Users/mstolarczyk/code/refgenie/docs_jupyter/data
 - /Users/mstolarczyk/code/refgenie/docs_jupyter/alias

Now, let's pull a fasta asset, which is one way to initialize a genome:

refgenie pull rCRSd/fasta --force
Compatible refgenieserver instances: ['http://rg.databio.org']
No local digest for genome alias: rCRSd
Setting 'rCRSd' identity with server: http://rg.databio.org/v3/genomes/genome_digest/rCRSd
Determined server digest for local genome alias (rCRSd): 94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4
Set genome alias (94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4: rCRSd)
Created alias directories: 
 - /Users/mstolarczyk/code/refgenie/docs_jupyter/alias/rCRSd
Downloading URL: http://rg.databio.org/v3/assets/archive/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/fasta
94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/fasta:default  100.0% •  •  • … •  • ? • … …
[?25hDownload complete: /Users/mstolarczyk/code/refgenie/docs_jupyter/data/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/fasta/fasta__default.tgz
Extracting asset tarball: /Users/mstolarczyk/code/refgenie/docs_jupyter/data/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/fasta/fasta__default.tgz
Default tag for '94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/fasta' set to: default
Initializing genome: rCRSd
Loaded AnnotatedSequenceDigestList (1 sequences)
Set genome alias (94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4: rCRSd)
Created alias directories: 
 - /Users/mstolarczyk/code/refgenie/docs_jupyter/alias/rCRSd/fasta/default

Following the refgenie pull command logs we notice that multiple steps happened: 1. refgenie used the human-readable genome name from the refgenie pull call (rCRSd) to query the server for any digest associated with it 2. refgenie set the digest it got back from the server as the genome identifier and set the human-readable genome name as an alias 3. refgenie used the genome idenfitier (not the user-specified name) to query the server for the fasta asset

From now on, the unique sequence-derived genome identifier will be used to query asset servers

refgenie pull rCRSd/bowtie2_index --force
Compatible refgenieserver instances: ['http://rg.databio.org']
Downloading URL: http://rg.databio.org/v3/assets/archive/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index
94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index:defau…  100.…   • •
[?25hDownload complete: /Users/mstolarczyk/code/refgenie/docs_jupyter/data/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index/bowtie2_index__default.tgz
Extracting asset tarball: /Users/mstolarczyk/code/refgenie/docs_jupyter/data/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index/bowtie2_index__default.tgz
Default tag for '94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index' set to: default
Created alias directories: 
 - /Users/mstolarczyk/code/refgenie/docs_jupyter/alias/rCRSd/bowtie2_index/default

Genome aliases

To make the user's life easier genome aliases system in refgenie allows to set arbitrary genome aliases that can be then used to refer to a genome. Users can interact with genome aliases using refgenie alias command:

refgenie alias --help
usage: refgenie alias [-h] {remove,set,get} ...

Interact with aliases.

positional arguments:
  {remove,set,get}
    remove          Remove aliases.
    set             Set aliases.
    get             Get aliases.

optional arguments:
  -h, --help        show this help message and exit

Set aliases

To set an alias "mito" for genome identified by digest 94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4 one needs to issue the command below:

refgenie alias set --aliases mito --digest 94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4
Set genome alias (94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4: mito)
Created alias directories: 
 - /Users/mstolarczyk/code/refgenie/docs_jupyter/alias/mito

Get aliases

To get see the entire aliases collection managed by refgenie one needs to issue the command below:

refgenie alias get
                          Genome aliases                          
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ genome                                           ┃ alias       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ 94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4 │ rCRSd, mito │
└──────────────────────────────────────────────────┴─────────────┘

alias and data directories

Refgenie stores asset data in two directories: alias and data. The data directory consists of the actual asset files, which are built or pulled from asset servers. The files in this directory are named using the digests, which helps refgenie to unambigously identify genomes. The alias holds symbolic links to asset data in data directory. This way users do not need to be aware of the digest-named files at all and there is no waste of disk space due to symbolic links.

Here's a general view of the contents of both directories:

tree alias -d 
tree data -d 
alias
├── mito
│   ├── bowtie2_index
│   │   └── default
│   └── fasta
│       └── default
└── rCRSd
    ├── bowtie2_index
    │   └── default
    └── fasta
        └── default

10 directories
data
└── 94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4
    ├── bowtie2_index
    │   └── default
    └── fasta
        └── default

5 directories

As you can see, the alias directory holds both of the defined aliases. Let's take a closer look at one of them

tree alias/rCRSd/fasta
alias/rCRSd/fasta
└── default
    ├── rCRSd.chrom.sizes -> ../../../../data/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/fasta/default/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4.chrom.sizes
    ├── rCRSd.fa -> ../../../../data/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/fasta/default/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4.fa
    └── rCRSd.fa.fai -> ../../../../data/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/fasta/default/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4.fa.fai

1 directory, 3 files

This explicitly shows that the files inside alias/rCRSd/fasta/default are in fact symbolic links that point to the actual asset files in data directory.