AIRR-Seq data: Allele Names

The Germline Genes page lists all alleles discovered in the data set(s). The Appearances column counts the number of individuals in which each allele is found. You can click on a number in the Appearances column to view a list of samples containing instances of the allele. Note that the number of samples may not match the number of appearances, because there can be more than one sample for some individuals. It’s also possible, where read counts vary substantially between samples, to find that a particular allele has not been identified in all samples of an individual.

‘Simple’ allele names, following the IMGT convention, for example IGHV1-69*01, identify the alleles that were listed in the IMGT reference set used by the pipeline. In some cases, the reference set may contain more than one allele with the same sequence: there are several instances of this in the IGHV set: IGHV3-30*02 and IGHV3-30-5*02 have identical sequences, for example. As alleles with identical sequences can’t be distinguished in AIRR-seq repertoires, VDJbase allocates all instances to one name, and lists names with identical sequences in the ‘identical’ column. We refer to these alleles with more than one name as ‘ambiguous’ alleles.

The pipeline allows VDJbase to infer ‘previously undocumented’ alleles: alleles that are not listed in the original reference set. These are named after the reference set allele with the closest sequence, followed by a list of the inferred polymorphisms, for example IGHV1-18*01_a190g. A run of adjacent polymorphisms (sometimes seen, for example, at the 3′ or 5′ end) is shortened by specifying the start and end co-ordinates, and the polymorphic sequence: for example TRBV10-3*03_313gccatcagtgagtc326.

VDJbase includes ‘short read’ datasets, in which part of the V-REGION (at the 5′ end) is omitted. With these, there is scope for greater ambiguity, because nucleotides distinguishing particular alleles, or even alleles of more than one gene, may be omitted. For these, we use custom reference sets, in which the ambiguity is included in the name. IGHV3-74*01_02, for example, refers to a seuence which could be attributed either to IGHV3-74*01 or to IGHV3-74*02, the differentiating nucleotides lying too close to the 5′ end to be found in this short-read dataset. A previously undocumented allele with close identity would have a name modified with its polymorphisms, for example IGHV3-74*01_02_g297a. Ambiguities between genes can be seen in some studies, most notably the Adaptive repertoires in the TRB dataset. TRBV3-1*01_02_2.01_2.02_2.03, for example, represents a sequence that is indistinguishable between alleles *01 and *02 of TRBV3-1, and alleles *01, *02 and *03 of TRBV3-2.

There are cases where no allele of a particular gene is found in a repertoire. These represent potential rather than actual deletions, because we cannot tell from AIRR-seq whether the gene is present in chromosomal DNA, we simply observe that it is not expressed at levels that are detected in the repertoire. For each gene in the reference set, VDJbase provides a name of the form IGHV1-24*Del, which can be used to identify such ocurrences.

Leave a comment