Using the OGRDB mouse germline sets

For many analyses, it is important to annotate AIRR-seq repertoires with an accurate and comprehensive germline set. As an example, determination of the overall mutation rate of a repertoire may give misleading results if frequently-expressed sequences are omitted. Likewise, many methods of clonal assignment require an accurate determination of the germline.

The IG loci of the mouse – both heavy and light chains – show substantial variation between commonly-used laboratory strains. Perhaps the most striking example is seen between the IGHV loci of the BALB/c and C57BL/6 strains, where a recent study identified only four sequences in common. Given the degree of variation, we provide here germline sets, drawn from recent publications, for commonly used strains. These sets were discussed and reviewed by a subgroup of the AIRR Community’s Germline Database Working Group, and will be kept under review as more information becomes available.

Because these sets were derived for the most part from AIRR-seq data, we are unable to provide mapping of the allele sequences to genes: for example, it is not possible to say on the basis of evidence to hand whether or not a sequence in the IGHV BALB/c set is derived from the same gene as a similar sequence from the C57BL/6 set. It is hoped that, over time, genomic information will allow such a determination to be made. For the time being, we adopt a naming convention that does not include an allele or subgroup (gene family) designation: please see our preprint and this poster for details. For compatibility with existing pipelines, the FASTA germline sets downloadable from OGRDB include a subgroup number of 0 and an allele number 00, creating names of the form IGKV0-32MJ*00. These ‘dummy’ family and allele numbers are not formally part of the name, but minimise issues with current software tools. We have published tools for manipulating these dummy numbers: the tools are still available, but we hope will not be necessary now that we have taken the decision to provide dummy numbers in the downloadable germline set.

In some cases the specific strain used in an experiment may not be known, or may not match any for which reference sets are provided. In that case we would advise annotating with both the BALB/c and C57BL/6 sets, and utilizing whichever set provides the lowest overall mutation rate.

Feedback on these sets, the overall approach, and tools provided is welcome and can be sent to william@lees.org.uk.