Introduction

The molecular synthesis and genetics of the immunoglobulin (IG) and T cell receptor (TR) chains is particularly complex and unique as it includes biological mechanisms such as DNA molecular rearrangements in multiple loci (three for IG and four for TR in humans) located on different chromosomes (four in humans), nucleotide deletions and insertions at the rearrangement junctions (or N-diversity), and somatic hypermutations in the IG loci (for review Refs 1, 2). The number of potential protein forms of IG and TR is almost unlimited. Owing to the complexity and high number of published sequences, data control and classification and detailed annotations are a very difficult task for the generalist databank such as EMBL, GenBank, and DDBJ.3,4,5 These observations were the starting point of IMGT, the international ImMunoGeneTics database® (http://imgt.cines.fr),6 created in 1989, by LIGM, at the Université Montpellier II, CNRS, Montpellier, France.

IMGT is a high-quality integrated information system specialising in IG, TR and MHC of human and other vertebrates which consists of several databases (IMGT/LIGM-DB, IMGT/3Dstructure-DB, IMGT/HLA-DB), Web resources (‘IMGT Marie-Paule page’) and interactive tools (IMGT/V-QUEST, IMGT/JunctionAnalysis).6 IMGT expertly annotated data and tools described in this paper are particularly useful for the analysis of the IG and TR rearrangements. By its high-quality and its easy data distribution, IMGT has important implications in medical research (repertoire in leukemias, lymphomas, myelomas, translocations, autoimmune diseases, AIDS), therapeutic approaches, and biotechnology related to antibody engineering. IMGT is freely available at http://imgt.cines.fr.

IMGT databases

The IMGT databases comprise three databases: (1) IMGT/LIGM-DB is a comprehensive database of IG and TR nucleotide sequences from human and other vertebrate species, with translation for fully annotated sequences, created in 1989 by LIGM, Laboratoire d'ImmunoGénétique Moléculaire, Montpellier, France, and on the Web since July 1995.6,7,8,9,10 In April 2002, IMGT/LIGM-DB contained 56 188 nucleotide sequences of IG and TR from 105 species. (2) IMGT/3Dstructure-DB is a database which provides the IMGT gene and allele identification and Colliers de Perles of IG and TR with known 3D structures, created by LIGM, on the Web since November 2001.11 In February 2002, IMGT/3Dstructure-DB contained 648 entries. (3) IMGT/HLA-DB is a database of the human MHC allele sequences, developed by Cancer Research UK, London, and ANRI (Anthony Nolan Research Institute), London, UK, on the Web since December 1998.12 The two specialized databases, IMGT/3Dstructure-DB and IMGT/HLA-DB, have been described elsewhere.11,12

IMGT/LIGM-DB data

IMGT/LIGM-DB sequence data are identified by the EMBL/GenBank/DDBJ accession number. The unique source of data for IMGT/LIGM-DB is EMBL which shares data with the other two generalist databases GenBank and DDBJ. Once the sequences are allowed by the authors to be made public, LIGM automatically receives IG and TR sequences by email from EMBL. After control by LIGM curators, data are scanned to store sequences, bibliographical references and taxonomic data, and standardized IMGT/LIGM-DB keywords are assigned to all entries. Based on expert analysis, specific detail annotations are added to IMGT flat files in a second step.7

Since August 1996, the IMGT/LIGM-DB content has closely followed that of the EMBL for the IG and TR, with the following advantages: IMGT/LIGM-DB does not contain sequences which have previously been wrongly assigned to IG and TR; conversely, IMGT/LIGM-DB contains IG and TR entries which have disappeared from the generalist databases (as examples: the L36092 accession number which encompasses the complete human TRB locus is still present in IMGT/LIGM-DB, whereas it has been deleted from EMBL/GenBank/DDBJ due to its too large size (684 973 bp); in 1999, IMGT detected the disappearance of 20 IG and TR sequences which inadvertently had been lost by GenBank, and allowed the recuperation of these sequences in the generalist databases).

IMGT/LIGM-DB interface and data distribution

One of the major objectives of IMGT was to provide immunologists with a user friendly interface. The Web interface allows searches according to immunogenetic specific criteria and is easy to use without any knowledge in a computing language. The interface allows the users to get easily connected from any type of platform (PC, Macintosh, workstation) using freeware such as Netscape. All IMGT/LIGM-DB information is available through search criteria (Figure 1): catalogue, accession number, mnemonic, definition, length, etc.; taxonomy, nucleic acid type, loci, genes or chains, functionality, structure, specificity, etc.; keywords; annotation labels; references.

Figure 1
figure 1

IMGT/LIGM-DB search page (http://imgt.cines.fr). Five modules of search are available: Catalogue, Taxonomy and Characteristics, Keywords, Annotation labels and References. These modules allow extensive and complex queries on immunoglobulin and T cell receptor sequences from human and other vertebrates. In April 2002, IMGT/LIGM-DB contained 56 188 sequences of IG and TR from 105 species. A short path selection allows a direct query with an accession number or with a part of it. For example, ‘AF306350’ will retrieve that sequence, whereas ‘AF306’ will retrieve all sequences beginning with AF306.

Selection is displayed at the top of the resulting sequence pages, so that users can check their own queries.9 Users have the possibility to modify their request or consult the results.9 They can (1) add new conditions to increase or decrease the number of resulting sequences, (2) view details concerning the selected sequences and choose among nine possibilities: annotations, IMGT flat file, coding regions with protein translation, catalogue and external references, sequence in dump format, sequence in FASTA format, sequence with three reading frames, EMBL flat file, IMGT/V-QUEST, or (3) search for sequence fragments corresponding to a particular label.9

IMGT/LIGM-DB data are also distributed by EBI (distribution of CD-ROM, network fileserver: netserv@ebi.ac.uk, and anonymous FTP server, ftp://ftp.ebi.ac.uk/pub/databases/imgt/), by the CINES anonymous FTP server (ftp://ftp.cines.fr/pub/IMGT/), and from many SRS (Sequence Retrieval System) sites.

IMGT/LIGM-DB can be searched by BLAST or FASTA on different servers (EBI, IGH, INFOBIOGEN, Institut Pasteur, etc.).

IMGT Web resources

IMGT Web resources (‘IMGT Marie-Paule page’)6 comprise the following sections: ‘IMGT Scientific chart’, ‘IMGT Repertoire’, ‘IMGT Bloc-notes’, ‘IMGT Education’, ‘IMGT Aide-mémoire’ and ‘IMGT Index’.

IMGT Scientific chart

The IMGT Scientific chart provides the controlled vocabulary and the annotation rules and concepts defined by IMGT13 for the identification, description, classification and numerotation of the IG and TR data of human and other vertebrates.

Concept of identification: standardized keywords:

IMGT standardized keywords for IG and TR include the following: (1) General keywords: indispensable for the sequence assignments, they are described in an exhaustive and non-redundant list, and are organized in a tree structure. (2) Specific keywords: they are more specifically associated with particularities of the sequences (orphon, transgene, etc.) or to diseases (leukemia, lymphoma, myeloma, etc.).7 The list is not definitive and new specific keywords can easily be added if needed. IMGT/LIGM-DB standardized keywords have been assigned to all entries.

Concept of description: standardized sequence annotation:

On hundred and seventy-seven feature labels are necessary to describe all structural and functional subregions that compose IG and TR sequences,7 whereas only seven of them are available in EMBL, GenBank or DDBJ. Annotation of sequences with these labels constitutes the main part of the expertise. Levels of annotation have been defined, which allow users to query sequences in IMGT/LIGM-DB even though they are not fully annotated.7

Prototypes represent the organizational relationship between labels and give information on the order and expected length (in number of nucleotides) of the labels.7,9

Concept of classification: standardized IG and TR gene nomenclature:

The objective is to provide immunologists and geneticists with a standardized nomenclature per locus and per species which will allow extraction and comparison of data for the complex B and T cell antigen receptor molecules.

The concepts of classification have been used to set up a unique nomenclature of human IG and TR genes, which was approved by HGNC, the HUGO (Human Genome Organization) Nomenclature Committee in 1999.6 The complete list of the human IG and TR gene names1,2,14,15,16,17,18,19,20 has been entered by the IMGT Nomenclature Committee in GDB, Toronto, and LocusLink, NCBI, USA, and is available from the IMGT site.6 IMGT reference sequences have been defined for each allele of each gene based on one or, whenever possible, several of the following criteria: germline sequence, first sequence published, longest sequence, mapped sequence.9,21 They are listed in the germline gene tables of the IMGT Repertoire.22,23,24,25,26,27,28,29 The protein displays show translated sequences of the alleles (*01) of the functional or ORF genes.1,2,30,31

Concept of numerotation: the IMGT unique numbering:

A uniform numbering system for IG and TR sequences of all species has been established to facilitate sequence comparison and cross-referencing between experiments from different laboratories whatever the antigen receptor (IG or TR), the chain type, or the species.32,33

This numbering results from the analysis of more than 5000 IG and TR variable region sequences of vertebrate species from fish to human. It takes into account and combines the definition of the framework (FR) and complementarity determining region (CDR),34 structural data from X-ray diffraction studies,35 and the characterization of the hypervariable loops.36 In the IMGT numbering, conserved amino acids from frameworks always have the same number whatever the IG or TR variable sequence, and from whatever species they come (as examples: Cysteine 23 (in FR1), Tryptophan 41 (in FR2), Leucine 89 and Cysteine 104 (in FR3)). Tables and graphs are available on the IMGT Web site at http://imgt.cines.fr and in Refs 1 and 2.

This IMGT unique numbering has several advantages:

  • It has allowed the redefinition of the limits of the FR and CDR of the IG and TR variable domains. The FR-IMGT and CDR-IMGT lengths become in themselves crucial information which characterize variable regions belonging to a group, a subgroup and/or a gene.

  • Framework amino acids (and codons) located at the same position in different sequences can be compared without requiring sequence alignments. This also holds for amino acids belonging to CDR-IMGT of same length.

  • The unique numbering is used as the output of the IMGT/V-QUEST alignment tool. The aligned sequences are displayed according to the IMGT numbering and with the FR-IMGT and CDR-IMGT delimitations.

  • The unique numbering has allowed a standardization of the description of mutations and the description of IG and TR allele polymorphisms.1,2 These mutations and allelic polymorphisms are described by comparison to the IMGT reference sequences of the alleles (*01).8,9

  • The unique numbering allows the description and comparison of somatic hypermutations of the IG IMGT variable domains.

By facilitating the comparison between sequences and by allowing the description of alleles and mutations, the IMGT unique numbering represents a big step forward in the analysis of the IG and TR sequences of all vertebrate species. Moreover, it gives insight into the structural configuration of the variable domain and opens interesting views on the evolution of these sequences, since this numbering has been applied with success to all the sequences belonging to the V-set of the immunoglobulin superfamily, including non-rearranging sequences in vertebrates (human CD4, Xenopus CTXg1, etc.) and in invertebrates (drosophila amalgam, drosophila fasciclin II, etc.).8,9,32,33

IMGT Repertoire

IMGT Repertoire is the global Web resource in ImMunoGeneTics for the immunoglobulins and T cell receptors of human and other vertebrates, based on the ‘IMGT Scientific chart’. IMGT Repertoire provides an easy-to-use interface to carefully and expertly annotated data on the genome, proteome, polymorphism and structural data of the IG and TR.6 Only titles of this large section are quoted here. Genome data include chromosomal localizations, locus representations, locus description, germline gene tables, potential germline repertoires, lists of IG and TR genes and links between IMGT, HUGO, GDB, LocusLink and OMIM, correspondence between nomenclatures.1,2 Proteome and polymorphism data are represented by protein displays, alignments of alleles, tables of alleles, allotypes, particularities in protein designations, IMGT reference directory in FASTA format, correspondence between IG and TR chain and receptor IMGT designations.1,2 Structural data comprise 2D graphical representations designated as Colliers de Perles,1,2,6,8,9 FR-IMGT and CDR-IMGT lengths, and 3D representations of IG and TR variable domains.10,12 This visualization permits rapid correlation between protein sequences and 3D data retrieved from the Protein Data Bank PDB. Other data comprise: (1) phages, (2) probes used for the analysis of IG and TR gene rearrangements and expression, and RFLP (restriction fragment length polymorphism) studies, (3) data related to gene regulation and expression: promoters, primers, cDNAs, reagent monoclonal antibodies, etc., (4) genes and clinical entities: translocations and inversions, humanized antibodies, monoclonal antibodies whith clinical indications, (5) taxonomy of vertebrate species present in IMGT/LIGM-DB, (6) immunoglobulin superfamily: gene exon-intron organization, protein displays, Colliers de Perles and 3D representations of V-LIKE and C-LIKE domains.

IMGT Bloc-notes

The IMGT Bloc-notes provides numerous hyperlinks towards the Web servers specializing in immunology, genetics, molecular biology and bioinformatics (associations, collections, companies, databases, immunology themes, journals, molecular biology servers, resources, societies, tools, etc.).37

IMGT Education

IMGT Education is a new section which provides useful biological resources for students. It includes figures and tutorials (in English and/or in French) on the IG and TR variable and constant domain 3D structures, the molecular genetics of immunoglobulins, the regulation of IG gene transcription, B cell differentiation and activation, etc.

IMGT Aide-mémoire and IMGT Index

IMGT Aide-mémoire provides an easy access to information such as genetic code, splicing sites, amino acid structures, restriction enzyme sites, etc.

IMGT Index is a fast way to access data when information has to be retrieved from different parts of the IMGT site. For example, ‘allele’ provides links to the IMGT Scientific chart rules for the allele description, and to the IMGT Repertoire Alignments of alleles and Tables of alleles.

IMGT interactive tools

IMGT/V-QUEST

IMGT/V-QUEST (V-QUEry and STandardization) is an integrated software for IG and TR.6 This tool, easy to use, analyses an input IG or TR germline or rearranged variable nucleotide sequence. IMGT/V-QUEST results comprise the identification of the V, D and J genes and alleles and the nucleotide alignment by comparison with sequences from the IMGT reference directory (Figure 2), the delimitations of the FR-IMGT and CDR-IMGT based on the IMGT unique numbering, the protein translation of the input sequence, the identification of the JUNCTION and the two-dimensional Collier de Perles representation of the V-REGION. The set of sequences from the IMGT reference directory, used for IMGT/V-QUEST can be downloaded in FASTA format from the IMGT site.

Figure 2
figure 2

IMGT/V-QUEST (http://imgt.cines.fr) results on gene and allele identification. IMGT/V-QUEST compares the input germline or rearranged IG or TR variable sequences with the IMGT/V-QUEST reference directory sets. For example, the highest scores for the input AF306366 rearranged sequence allow identification of IGHV1-3*01, IGHD3-10*01, IGHJ4*02 as being the genes and alleles involved in the V-D-J rearrangement. The IMGT/V-QUEST results comprise the translation of the JUNCTION for rearranged sequences, and also, not shown in the figure, the delimitations of the FR-IMGT and CDR-IMGT, the protein translation and the two-dimensional representation or Collier de Perles of the V-REGION. Information provided by IMGT/V-QUEST (V and J gene and allele names, sequence of the JUNCTION (from 2nd-CYS 104 to J-PHE or J-TRP 118)) can then be used in IMGT/JunctionAnalysis for a confirmation of the D gene and allele identification and a more accurate analysis of the junction (see Figure 3).

IMGT/JunctionAnalysis

IMGT/JunctionAnalysis is a tool, complementary to IMGT/V-QUEST, which provides a thorough analysis of the V-J and V-D-J junction of IG and TR rearranged genes (Figure 3). IMGT/JunctionAnalysis identifies the D-GENE and allele involved in the IGH, TRB and TRD V-D-J rearrangements by comparison with the IMGT reference directory, and delimits precisely the P, N and D regions (Figure 3).1,2 Results from IMGT/JunctionAnalysis are more accurate than those given by IMGT/V-QUEST regarding the D-GENE identification. Indeed, IMGT/JunctionAnalysis works on shorter sequences (JUNCTION), and with a higher constraint since the identification of the V-GENE and J-GENE and alleles is a prerequisite to perform the analysis. Several hundreds of junction sequences can be analyzed simultaneously.

Figure 3
figure 3

IMGT/Junction-Analysis (http://imgt.cines.fr) results. The IMGT/JunctionAnalysis results comprise, for each junction, the identification of the D-GENE and allele, the identification of the P and N regions (N1, N2, etc.) and their precise delimitations, and the junction translation. The CDR3-IMGT numbering is according to the IMGT unique numbering for V-DOMAIN. Vmut, Dmut and Jmut correspond to the number of nucleotide differences in the input junction sequence by comparison to the germline allele sequences. Ngc is the ratio of the number of g+c nucleotides to the total number of nucleotides in the N regions. IMGT/JunctionAnalysis analyses, in a single search, an unlimited number of junctions provided that the V-GENE and J-GENE allele IMGT names are identified.

IMGT-ONTOLOGY and IMGT interoperability

IMGT-ONTOLOGY

IMGT distributes high-quality data with an important incremental value added by the IMGT expert annotations, according to the rules described in the IMGT Scientific chart. IMGT has developed a formal specification of the terms to be used in the domain of immunogenetics and bioinformatics to ensure accuracy, consistency and coherence in IMGT. This has been the basis of the IMGT-ONTOLOGY,13 the first ontology in the domain, which allows the management of the immunogenetics knowledge for all vertebrate species. Control of coherence in IMGT combines data integrity control and biological data evaluation.38,39

IMGT interoperability

Since July 1995, IMGT has been available on the web at http://imgt.cines.fr. IMGT provides biologists with an easy to use and friendly interface. From January 1996 to April 2002, the IMGT WWW server at Montpellier was accessed by more than 164 000 sites. IMGT has an exceptional response with more than 100 000 requests a month. Two thirds of the visitors are equally distributed between the European Union and the United States. To facilitate the integration of IMGT data into applications developed by other laboratories, we have built an Application Programming Interface to access the database and its software tools (see ‘IMGT Informatics page (API...)’).38 This API includes a set of URL links to access biological knowledge data (keywords, labels, functionalities, list of gene names, etc.), a set of URL links to access all data related to one given sequence, a set of JAVA class packages to select and retrieve data from an appropriate IMGT server using an object-oriented approach.

Conclusion

The information provided by IMGT is of much value to clinicians and biological scientists in general.40 Tools for the analysis of genetic and phylogenetic data (IMGT/PhyloGene) and the display of physical maps (IMGT/GeneView, IMGT/LocusView) and new specific databases (IMGT/PROTEIN-DB, IMGT/PRIMER-DB) are currently in development and will be integrated into IMGT. IMGT/PROTEIN-DB, a protein database for IG and TR, will contain translations of potentially functional and ORF sequences from IMGT/LIGM-DB, and protein data from Kabat et al34 and PDB. IMGT/PRIMER-DB is an oligonucleotide primer database for IG, TR, and MHC, developed in collaboration with EUROGENTEC (Belgium). More particularly, IMGT/PRIMER-DB will integrate information on primers used for the analysis of the IG and TR gene repertoire and expression, and in the detection of minimal residual diseases in B and T cell malignancies.41,42,43,44,45,46,47,48,49,50,51 IMGT is designed to allow a common access to all immunogenetics data, and particular attention is given to the establishment of cross-referencing links to other databases pertinent to the users of IMGT.

Note: Citing IMGT

Authors who make use of the information provided by IMGT should cite Ref. 6 as a general reference for the access to and content of IMGT, and quote the IMGT home page URL, http://imgt.cines.fr.