Downloading germline sets from the command line or API

In this post, we’re going to look briefly at the different ways that you can download germline sets from the command line, or from code. We want to make germline sets on OGRDB as easy to work with as possible, and making them available for download conveniently is an important part of that, so that they can be kept up to date in websites, tools, and other applications.

If you hover over a download button on a page listing germline sets, you will see a link similar to this one:

https://ogrdb.airr-community.org/download_germline_set/Human/IGH_VDJ/published/gapped_ex

You can use these links at the command line:

>curl https://ogrdb.airr-community.org/download_germline_set/Human/IGH_VDJ/published/gapped_ex
IGHD1-1*01
GGTACAACTGGAACGAC

We’ll explore the anatomy of this link in a while, but before covering that, let’s explore one other option, and why you might want to use it. The link above is really intended for use on the OGRDB web site, and won’t give an easy-to-process error if something goes wrong. For that reason, it may be better to use a link from the OGRDB REST API, which has exactly the same form but a slightly different prefix:

>curl https://ogrdb.airr-community.org/api/germline/set/Human/IGH_VDJ/published/gapped_ex
IGHD1-1*01
GGTACAACTGGAACGAC

If we request a nonexistent germline set, for example, through the API, the response code will be 404 and the status will be easy to process:

curl https://ogrdb.airr-community.org/api/germline/set/Human/IGH_XXX/published/gapped_ex
{
"error": "Set not found"
}

The REST API has calls that you can use to list the species for which OGRDB has sets available, the sets themselves, versions available and so on. You can browse the api, and experiment with it, by following the link. However, if you just want a link to a particular set, it may be easier just to take the download link from the download button on the lists page, and swap the segment download_germline_set/ for api/germline/set/.

The germline set links have five components:

species – the species as used within OGRDB. OGRDB uses informal species names, such as Human and Mouse. The first letter is capitalised.

species_subgroup- this may designate a strain, breed, or some other species subgroup, for example BALB/c for Mouse. Because of the way that slashes are processed by our web server, if the species_subgroup contains slashes, they must be replaced in the link with the string %252f, so that BALB/c would become BALB%252fc. If no species_subgroup is applicable for the germline set, this component is left out, meaning that the link will have four components rather than five.

name – the name of the germline set. As above, if this contains slashes, %252f should be used in place of the slash.

revision_number – this is the revision number of the set, if a specific revision is desired, or ‘published’ or ‘latest’ to obtain the currently published version.

format – one of airr, gapped, ungapped to retrieve the AIRR-C format, or gapped/ungapped FASTA formats. For Human sets, ‘_ex’ should be added to retrieve the Reference Set. Without the ‘_ex’, the Source Set will be retrieved.