Human IGHC set released

We’re pleased to publish the first version of a human IGHC germline set on OGRDB. The set is based on the work of Jana et al., 2025, and contains 105 full-length IGH constant region sequences, derived from long-read genomic sequencing. This work substantially increased the number of published sequences and revealed significant polymorphism and population diversity in constant region genes. Cell Genomics have also published a helpful commentary.

Formats

The downloadable sets in FASTA format contain full-length coding sequences of both the transmembrane and secreted forms. The sequences of the secreted forms have a _SC suffix to the name. Please note that the nucleotide sequences are not in-frame, because, during splicing, the constant region picks up a G from the J-gene at its 5′ end.

The downloadable JSON file contains full metadata. We have added extensions to the MiAIRR germline schema for this purpose: specifically we have added c_exon coordinates for up to 9 constant region exons, and the field secretory_coding_sequence. The transmembrane coding sequence is stored in coding_seq_imgt. These extensions to the schema will be submitted to the AIRR Community’s Standards Working Group for ratification.

Naming

The sequences are named after the closest currently-named allele in the IUIS set, with snps added to describe the variation, for example IGHA1*01_c1068g. Where the number of snps would make the name unfeasibly long, a four-character code is used in place of the snp, for example IGHA1*06_6bc5. While the sequences will be submitted to IUIS for specific names, we elected to publish with interim names because the the review and naming process may take some time: in particular because the gene duplication and other structural variation discovered in the study raises some complex questions.

Usage

A growing number of methods such as 10x, PacBio and ONT technologies allow the constant region to be fully or near-fully characterized in repertoire sequencing. For other short-read methods, where only a small length of the constant region is captured, we recommend building a tailored set that is matched to the expected length captured and disambiguated so that it does not contain multiple identical sequences.