Using the VDJbase REST API – VDJbase and OGRDB

The REST API allows you to access VDJbase programatically, which can be useful if you want to incorporate data into your own systems, and keep it up-to-date. All communication between the VDJbase client in the browser and the VDJbase server is via the API: any data you see in the browser has been provided over the API.

The API is rooted at https://vdjbase.org/admin/api/. If you click on that link, you will find a Swagger UI through which you can browse and test API features. You can also explore the API by monitoring requests from your browser to VDJbase as you navigate the site, using browser diagnostics, Postman, or a similar tool.

The API is divided into four areas. Two of these – system and reports – are intended for use within the VDJbase client and won’t be described here, although you can monitor their use as above if you are interested. The other two – repseq and genomic – can be used to request core data from the server. We will look at some simple applications in the following sections. Examples are in python, using the requests module.

Requesting AIRR-seq data

The data is organised by species and dataset. dataset will usually correspond to a locus, e.g. IGH. You can use the API to find the available species, and the loci available for a species of interest:

Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type “help”, “copyright”, “credits” or “license” for more information.
>>>import requests >>>api = 'https://vdjbase.org/admin/api/' >>>r = requests.get(api + 'repseq/species') >>>r.json() ['Crab-eating Macaque', 'Human', 'Rhesus Macaque']

>>>r = requests.get(api + 'repseq/ref_seqs/Human') >>>r.json() [{'dataset': 'IGH', 'description': 'Analysis of Human IGH datsets, compiled 24th November 2021'}, {'dataset': 'IGK', 'description': 'Analysis of Human IGK datsets, compiled 1st August 2021'}, {'dataset': 'IGL', 'description': 'Analysis of Human IGL datsets, compiled 1st August 2021'}, {'dataset': 'TRB', 'description': 'Analysis of Human TRB datsets, compiled 7th July 2021'}]

By design, /species returns the list of any species that have either genomic or AIRR-seq data. It’s possible, therefore, that the /ref_seqs query (which lists datasets) may return an empty list for some species.

Queries of the datasets themselves can use paging and filtering parameters (see the following section on these for more information). We will request information on samples, using paging to restrict this to the first 5 entries.

>>>r = requests.get(api + '/repseq/samples/Human/IGH', params={'page_number': 1, 'page_size': 5}) >>>result=r.json() >>>result.keys() dict_keys(['samples', 'uniques', 'total_items', 'page_size', 'pages'])

result['samples'] is a list containing the sample information itself. result['uniques'] contains information used by the filters in the VDJbase client UI and is not of interest here. The other fields summarise paging information.

>>>len(result['samples'][0].keys()) 143 >>>result['samples'][0]['sample_name'] 'P1_I100_S1' >>>result['samples'][0]['reads'] 20140

Finally in this example, we can request the genotype of a selected sample. The genotype is returned in AIRR Data Standard format:

>>>r = requests.get(api + '/repseq/genotype/Human/P1_I100_S1') >>>result = r.json() >>>result.keys() dict_keys(['GenotypeSet']) >>>result['GenotypeSet']['genotype_class_list'][0].keys() dict_keys(['receptor_genotype_id', 'locus', 'documented_alleles', 'undocumented_alleles', 'deleted_genes', 'inference_process', 'genotyping_tool', 'genotyping_tool_version'])

Requesting Genomic Data

The overall approach is the same as outlined for AIRR-seq data above.
Requesting species and available loci:

>>>r = requests.get(api + ‘/genomic/species’)
>>>r.json()
['Crab-eating Macaque', ‘Rhesus Macaque’, ‘Human’]
>>>r = requests.get(api + ‘/genomic/data_sets/Rhesus Macaque’)
>>>r.json()
[{‘dataset’: ‘IGH’, ‘locus’: ‘IGH’}]

For genomic data we assume one sample per subject, hence the next request is for subjects rather than samples:

>>>r = requests.get(api + '/genomic/subjects/Rhesus Macaque/IGH') >>>request = r.json() >>>request.keys() dict_keys(['samples', 'uniques', 'total_items', 'page_size', 'pages'])

>>>request['samples'][0].keys() dict_keys(['id', 'identifier', 'name_in_study', 'mother_in_study', 'father_in_study', 'age', 'sex', 'annotation_path', 'annotation_method', 'annotation_format', 'annotation_reference', 'self_ethnicity', 'grouped_ethnicity', 'population', 'population_abbr', 'super_population', 'locus_coverage', 'sequencing_platform', 'assembly_method', 'DNA_source', 'study_name', 'study_title', 'study_id', 'study_date', 'study_description', 'institute', 'researcher', 'reference', 'contact', 'dataset'])

>>>request['samples'][0]['identifier'] 'P24_I1'

Requesting the genotype of a subject:

>>>r = requests.get(api + '/genomic/genotype/Rhesus Macaque/P24_I1') >>>result = r.json() >>>result['GenotypeSet']['genotype_class_list'][0].keys() dict_keys(['receptor_genotype_id', 'locus', 'documented_alleles', 'undocumented_alleles', 'deleted_genes', 'inference_process', 'genotyping_tool'])

Paging and Filtering Fields

Requests for subject and sample data can use these optional fields to restrict the amount of information returned.

page_size – the maximum number of records to return in this request.

page_number – the number of the page to return. The first page is 1. The result from any request with non-zero page_size will include the total number of pages.

sort_by – json string specifying a list of columns by which results should be sorted (the subject_name or sample_name is used by default). The syntax is [{"field": "<column_name>", "order": "<order>}] where <order> is either "asc" or "desc".

cols – json string specifying a list of the column names for which data should be returned (by default all columns are returned). You can make a request for all columns to determine the column names.

filter – json string specifying a list of filters. Only rows that satisfy all the filters in the list will be returned. The filter element has the following fields:
field – name of the column it should be applied to
op – the comparison operator
value – the value to apply

Example: '{"field": "appears", "op": ">", "value": "1"}'

Operators:
'is_null','is_not_null', '==','eq', '!=','ne', '>','gt', '<','lt',
'>=','ge', '<=','le', 'like', 'ilike','not_ilike', 'in','not_in'
For more information, see sqlalchemy_filters.

Example embodying all filter fields:

>>>r = requests.get(api + '/repseq/sequences/Human/IGH', params={ 'page_number': 1, 'page_size': 5, 'sort_by': '[{"field": "appearances", "order":"asc"}]', 'cols': '["name", "appears"]', 'filter': '[{"field": "appears", "op": ">", "value": "1"}]', }) >>>r.json()['samples'] [{'name': 'IGHV3-30-201', 'appears': 2, 'dataset': 'IGH', 'igsnper_plot_path': ''}, {'name': 'IGHV3-30-5201', 'appears': 2, 'dataset': 'IGH', 'igsnper_plot_path': ''}, {'name': 'IGHV3-702', 'appears': 2, 'dataset': 'IGH', 'igsnper_plot_path': ''}, {'name': 'IGHV3-7402', 'appears': 2, 'dataset': 'IGH', 'igsnper_plot_path': ''}, {'name': 'IGHV4-28*07', 'appears': 2, 'dataset': 'IGH', 'igsnper_plot_path': ''}]