Human IGHC set released

We’re pleased to publish the first version of a human IGHC germline set on OGRDB. The set is based on the work of Jana et al., 2025, and contains 105 full-length IGH constant region sequences, derived from long-read genomic sequencing. This work substantially increased the number of published sequences and revealed significant polymorphism and population diversity in constant region genes.

Formats

The downloadable sets in FASTA format contain full-length coding sequences of both the transmembrane and secreted forms. The sequences of the secreted forms have a _SC suffix to the name. Please note that the nucleotide sequences are not in-frame, because, during splicing, the constant region picks up a G from the J-gene at its 5′ end.

The downloadable JSON file contains full metadata. We have added extensions to the MiAIRR germline schema for this purpose: specifically we have added c_exon coordinates for up to 9 constant region exons, and the field secretory_coding_sequence. The transmembrane coding sequence is stored in coding_seq_imgt. These extensions to the schema will be submitted to the AIRR Community’s Standards Working Group for ratification.

Naming

The sequences are named after the closest currently-named allele in the IUIS set, with snps added to describe the variation, for example IGHA1*01_c1068g. Where the number of snps would make the name unfeasibly long, a four-character code is used in place of the snp, for example IGHA1*06_6bc5. While the sequences will be submitted to IUIS for specific names, we elected to publish with interim names because the the review and naming process may take some time: in particular because the gene duplication and other structural variation discovered in the study raises some complex questions.

Usage

A growing number of methods such as 10x, PacBio and ONT technologies allow the constant region to be fully or near-fully characterized in repertoire sequencing. For other short-read methods, where only a small length of the constant region is captured, we recommend building a tailored set that is matched to the expected length captured and disambiguated so that it does not contain multiple identical sequences.