ASHG01 Nomenclature Workshop (ASHG01-NW)

Warning: Search requires JavaScript to function. more...

Contents

Agenda
Summary
- Participants
Draft Guidelines for Human Gene Nomenclature - 2001
- Introduction
- Summary

This is an archived page. The information displayed has not been updated.

Friday October 12, 2001
San Diego
The Convention Center - Room 14A
14:00-18:00

Agenda

14:00 - 14:30: Introduction of Proposed Guidelines
14:30 - 15:30: Discussion of Guidelines
15:30 - 16:00: Refreshments
16:00 - 17:30: Further Discussion of Guidelines
17:30 - 18:00: Agreement of Proposed Guidelines and alterations

Summary

Summary of "The Proposal of New Human Gene Nomenclature Guidelines"

Participants

Cindy Smith (MGD), Hester Wain (HGNC), Sue Povey (HGNC), Elspeth Bruford (HGNC), Cara Hunsberger (Molecular Therapy and Genomics), Donna Maglott (NCBI/LocusLink), Stephen Scherer (HGAC), Panos Deloukas (Sanger), Gerard Manning (Kinases, Sugen), Chris Porter (GDB), Alan Scott (OMIM), Kim Pruitt (Ref_Seq), Lucy Osborne (WBS Region), Lap-Chee Tsui (HUGO).

Updates from this meeting have now been incorporated into the guidelines.

A list of the issues discussed and actions recommended is detailed below:

Definition of a gene: Does this include objects which are never transcribed? Yes, pseudogenes.
Would we change any gene symbols retrospectively to conform with the new guidelines? No.
Definition of a locus: There was much discussion about this, and no agreement was reached, so for the meantime we will maintain the MGD definition.
Antisense and intronic transcripts are given their own symbol in the mouse, not one that reflects the other gene. Antisense genes in human will only be assigned an AS suffix if there is proven regulatory function but no other known structure or function on which to base a gene name. Genes of unknown function on the opposite strand which have no proven regulatory function should be assigned the suffix OS for "opposite strand".
Anonymous gene families: MGD are proposing an alternative to the FAM scheme (using GG for gene grouping). We are looking into this.
Aliases: All known aliases should be added to the gene entry including those used for proteins. The accession ID for each alias is not currently stored in Genew.
Pseudogene symbol: An alternative of PS instead of P was suggested to make pseudogenes more easily identifiable in searches. However, it was felt this would be no better than P and that information is required from another field i.e. name or function, to ascertain whether the object was a pseudogene. We may be able to introduce a function field to the Genew database in the future. The change in numbering system was discussed and it was agreed to maintain already used systems but to encourage authors when submitting new families of pseudogenes to identify each with its own number and then the suffixP.
The reason for not approving unique symbols such as RING# and BING# was also discussed. These are not approved because the symbols have seriously misled a number of scientists into believing that these genes are related when they are not. C#orfsymbols were suggested as a reasonable temporary solution to this type of problem.
Orthologous genes: Genes which are initially identified in another organism e.g. mouse, will usually be assigned the same symbol in human. However, we do add the original organism to the gene name. It was decided that we would update all previously assigned gene names that indicated organism orthology in a consistent manner to improve database searching. Thus, all gene names containing organism specifications will now indicate the organism in parentheses at the end of the gene name e.g.DEF6 for "differentially expressed in FDCP (mouse homolog) 6" will become: "differentially expressed in FDCP 6 (mouse)".
Ribosomal RNA gene symbols: We are looking into guidelines for these symbols.
LocusXRef: This is the private database that HGNC edits which exports directly to LocusLink. At present, to maintain confidentiality, we sometimes have to create a new entry even if the gene sequence already has an entry. We are investigating other alternatives to generating two gene records that then subsequently need to be merged.
Ref_Seq: There was a query regarding the identification of the sequence used by Ref_Seq. HGNC maintains its own archived copy of the submitted sequence with each gene symbol.

Draft Guidelines for Human Gene Nomenclature - 2001

The Draft Guidelines were updated on 17 and 21 September and 3 October 2001, updates shown in red; recent updates (1 November) from the workshop and other sources are shown inpink.
Please note that not all browsers support the fonts used in this document, this is particularly pertinent when viewing the Greek-Latin letter conversion table.

Introduction

Guidelines for human gene nomenclature were first published in 1979 by Shows et al. when the Human Gene Nomenclature Committee was first given the authority to approve and implement human gene names and symbols. Updates of these guidelines were published in 1987 (Shows et al.), 1995 (McAlpine P) and 1997 (White et al.). With the recent publications of the complete human genome sequence there is an estimated total of 26,000-40,000 genes, as suggested by the International Human Genome Sequencing Consortium (2001) and Venter et al. (2001). Thus, the guidelines have been updated to accommodate their application to this wealth of information although symbols are still only assigned when required for communication about genes.

The philosophy of the HGNC remains "that gene nomenclature should evolve with new technology rather than be restrictive as sometimes occurs when historical and single gene nomenclature systems are applied" (Shows et al 1987).

Summary

1. Each approved gene symbol must be unique.
2. Symbols are short-form representations (or abbreviations) of the descriptive gene name.
3. Symbols should only contain Latin letters and Arabic numerals.
4. Symbols should not contain punctuation.
5. Symbols should not end in "G" for gene.
6. Symbols do not contain any reference to species e.g. "h" for human.

1. Criteria for symbol assignment

Gene
A gene is defined as: "a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterized by sequence, transcription or homology". The overwhelming majority of objects named by HGNC are in this category.

Locus
The word locus is not a synonym for gene and refers to a map position. A more precise definition is given in the Rules and Guidelines from the International Committee on Standardized Genetic Nomenclature for Mice which states: "A locus is a point in the genome, identified by a marker, which can be mapped by some means. It does not necessarily correspond to a gene; it could, for example, be an anonymous non-coding DNA segment or a cytogenetic feature. A single gene may have several loci within it (each defined by different markers) and these markers may be separated in genetic or physical mapping experiments. In such cases, it is useful to define these different loci, but normally the gene name should be used to designate the gene itself, as this usually will convey the most information."

Chromosome Region
In the context of gene nomenclature this is defined as a genomic region which has been associated with a particular syndrome or phenotype, particularly when there is a possibility that several genes within it may be involved in the phenotype. Designation of such regions may be requested by the scientific community and approved by HGNC e.g. etc. ANCR "Angelman syndrome chromosome region" and CECR "cat eye syndrome chromosome region".

Symbols may therefore be assigned to the following:

Clearly defined phenotypes shown to be inherited predominantly as monogenic Mendelian traits e.g. BBS1 "Bardet-Biedl syndrome 1".
Unidentified genes contributing to a complex trait shown by linkage or association with a known marker e.g. IDDM6 "insulin-dependent diabetes mellitus 6".
Cloned segments of DNA with sufficient structural, functional, and expression data to identify them as transcribed entities e.g. COX8 "cytochrome c oxidase subunit VIII".
Non-functional copies of genes (pseudogenes) e.g. IL9RP1 "interleukin 9 receptor pseudogene 1".
Genes encoded by the opposite (antisense) strand that overlap a known gene e.g. IGF2AS "insulin-like growth factor 2, antisense".
Transcribed but untranslated functional DNA segments e.g. XIST "X (inactive)-specific transcript".
Cellular phenotypes from which the existence of a gene or genes can be inferred e.g. LOH18CR1 "loss of heterozygosity, 18, chromosomal region 1".
EST clusters which suggest a putative gene e.g. C1orf1 "chromosome 1 open reading frame 1".
Fragments of expressed sequence will be designated a D-number by GDB (the Genome Database) e.g. DXYS155E (Appendix 1)
Polycistronic genes generated from a single mRNA, but with independent coding sequence, physically separable and non-overlapping with other coding sequence giving independent gene products e.g. SNURF "SNRPN upstream reading frame" and SNRPN "small nuclear ribonucleoprotein polypeptide N".
Genes of unknown function which share highly similar sequences e.g. FAM7A1 "family with sequence similarity 7, member A1" etc.
Predicted genes (in silico) which show a high degree of sequence homology to well characterised genes will be assigned the same symbol with an "L" for like e.g. TCP10L "t-complex 10 (a murine tcp homolog)-like".
Intronic transcripts (on the same DNA strand) will be assigned separate symbols, usually relating to the gene in which they reside e.g. COPG2IT1 "coatomer protein complex, subunit gamma 2, intronic transcript 1".

Gene symbols will not usually be assigned to alternative transcripts or to genes predicted solely from in silico data (with no other supporting evidence e.g. significant homology to a characterised gene).

2. Gene symbols

Human gene symbols are designated by upper-case Latin letters or by a combination of upper-case letters and Arabic numerals, with the exception of the C#orf# symbols. Symbols should be short in order to be useful, and should not attempt to represent all known information about a gene (White et al. 1998). Symbols should be inoffensive and should not spell words or match abbreviations that would cause problems with database searching e.g. DNA. Ideally, symbols should be no longer than six characters in length. New symbols must not duplicate existing approved gene symbols in either the human (Genew: Human Gene Nomenclature Database) or the mouse (MGD) databases.

The initial character of the symbol should always be a letter. Subsequent characters may be other letters, or if necessary, Arabic numerals.
All characters of the symbol should be written on the same line; no superscripts or subscripts may be used.
No Roman numerals may be used. Roman numbers in previously used symbols should be changed to their Arabic equivalents.
Greek letters are not used in gene symbols. All Greek letters should be changed to letters in the Latin alphabet (App 2: Table 1). Note that such gene symbols will then appear in lists in the order of the Latin alphabet.
A Greek letter prefixing a gene name must be changed to its Latin alphabet equivalent and placed at the end of the gene symbol. This permits alphabetical ordering of the gene in listings with similar properties, such as substrate specificities e.g. GLA "galactosidase, alpha"; GLB "galactosidase, beta".
No punctuation may be used, with the exception of the HLA, immunoglobulin and T cell receptor gene symbols (which may be hyphenated). The HLA symbols are assigned by the WHO Nomenclature Committee for Factors of the HLA System (Marsh et al 2001) via the IMGT/HLA database (Robinson et al. 2001). The immunoglobulin and T cell receptor gene symbols are assigned by the IMGT Nomenclature Committee via the IMGT/LIGM database (Lefranc 2001).
Gene symbols will not usually be assigned to alternative transcripts. However, if a community working on a group of genes has a need for nomenclature where there are multiple small coding sequences which can be combined to form a number of different larger products then these coding sequences may be assigned symbols e.g. PCDHA1 "protocadherin alpha 1" and PCDHA2 "protocadherin alpha 2" and UGT1A1 "UDP glycosyltransferase 1 family, polypeptide A1" to UGT1A13 representing 13 distinct gene symbols because there are 13 distinct promoter/first exons of the UDP glucuronosyltransferase-1 gene, spanning >500 kb, in which about one-half of the enzyme (including catalytic specificity) is encoded by exons 1 and the other half by exons 2 through 5.
Tissue specificity or molecular weight should be avoided, where necessary this may be included in the gene name.
Some letters or combination of letters are used as prefixes or suffixes in a symbol to give a specific meaning and their use for other meanings should be avoided (Section 10).

3. Gene names

Gene names should be brief and specific and should convey the character or function of the gene, but should not attempt to describe everything known about it. The first letter of the symbol should be the same as that of the name in order to facilitate alphabetical listing and grouping. Gene names are written using American spelling. Tissue specificity and molecular weight designations should be avoided as they have only limited use as a description and may in time and across species prove inaccurate; however, they may be incorporated into the gene name if absolutely necessary. Gene names should not include terms such as nephew, cousin, sister etc. to describe familial relationships with other genes. The following gene name syntax should be used:

Names start with a lower case letter unless it is a person's name describing a disease/phenotype e.g. AHDS "Allan-Herndon-Dudley syndrome".
Descriptive modifiers should follow the main part of the name, separated by commas e.g. ACO1 "aconitase 1, soluble".
Where a complete alternative name (or names) is being included as part of the name, this should be in parentheses e.g. IDS "iduronate 2-sulfatase (Hunter syndrome)".
Names of other species must be placed in parentheses at the end e.g. LFNG "lunatic fringe homolog (Drosophila)".

4. Gene Families

Hierarchical symbols for both structural and functional gene families should be used where possible. A stem (or root) symbol as a basis for a symbol series allows easy identification of other family members in both database searches and the literature. Gene family members should be designated by Arabic numerals placed immediately after the gene stem symbol, without any space between the letters and numbers used e.g. GPR1, GPR2, GPR3 (three G protein-coupled receptor genes ). However, very occasionally, only if they exist historically, single-letter suffixes may be used to designate these different genes e.g. LDHA, LDHB, LDHC (three lactate dehydrogenase genes). Some gene families are very large and so may include a variety of number/letter combinations to indicate their phylogenetic relationships e.g. CYP1A1, CYP21A2, CYP51A1 (three members of the cytochrome P450 superfamily). When symbols are approved by the HGNC consecutively in a gene family the hierarchical order of symbols will not necessarily reflect the chronological order of peer-reviewed publications. Consecutive symbols approved by HGNC take precedence over those published, although this will be a matter for discussion with the relevant scientific community.

Many genes receive approved symbols and names which are non-ideal when considered in the light of subsequent information. In the case of individual genes a change to the name (and subsequently the symbol) is only made if the original name is seriously misleading. However, groups of scientists working on particular gene families often co-ordinate a revised nomenclature when more information becomes available; such initiatives are welcomed. Groups planning to do this are strongly advised to liase with the HGNC in order to avoid possible problems. Previously approved symbols which have been withdrawn should not be re-used as this causes great confusion in information retrieval.

Anonymous families
When a series of genes can be shown to be related by sequence similarity but otherwise cannot be described by homology or function they can be assigned anonymous and temporary FAM# symbols. Membership in each family, indicated by the FAM number, will be assigned using the general criteria established by Mackenzie et al. (1997) and Nelson et al. (1996): that the gene products show greater than 40% amino acid identity. However, these criteria are not exclusive but merely a convenient starting point on which to base an appropriate symbol. The symbol will always include a letter designation for sub-family, with the final character indicating gene number e.g. FAM7A1 "family with sequence similarity 7, member A1" etc. Once further information is available to assign a more descriptive name the whole gene family should be updated with new symbols; the FAM designations being withdrawn.

5. Homologies with other species

Homologous genes in different vertebrate species (orthologs) should where possible have the same gene nomenclature.

Human homologs of genes first identified in other species should not be designated by a symbol beginning with H or h for human.
When a gene or series of genes has been defined in one species, and it is reasonable to expect that in the future a homologous gene will be identified in man, we recommend that the designated symbol be reserved for the human locus. We recommend that this should be done in other species for genes first identified in human.
When necessary to distinguish the species of origin for homologous genes with the same gene symbol, the letter-based code for different species already established by SWISS-PROT should be used. This can be found at URL <http://www.expasy.ch/cgi-bin/speclist> and commonly used species are shown in App 2: Table 2.
The agreement between human and mouse gene nomenclature for many homologous genes should be continued and extended to other vertebrate species where possible.
Human homologs of genes in invertebrate, or prokaryote species may sometimes be represented by the symbol used in the other species, usually followed by an L to represent "like", and a number if there is (or is likely to be) more than one human homolog and the organism designated at the end of the gene name in parentheses e.g. BARHL1 "BarH-like 1 (Drosophila)". The use of H to represent homolog is no longer recommended, and will be discontinued.

6. Genes identified from sequence information

Antisense
A gene of unknown function, encoded at the same genomic locus (with overlapping exons) as another gene should have its own symbol. If the new gene regulates the first gene it may be assigned the symbol of the first gene with the suffix AS for antisense e.g. IGF2AS "insulin-like growth factor 2, antisense". The gene symbol should not be written backwards.

Opposite Strand
Genes of unknown function on the opposite strand, which have no proven regulatory function, should be assigned the suffix OS for "opposite strand".

Untranslated Functional RNAs
These may be assigned symbols that are unique and relevant to the scientific community. However, the approved name should contain "untranslated RNA" e.g. H19 "H19, imprinted maternally expressed untranslated RNA".

Related (-like) sequences
The designation of the suffix "L" for like has been used in the past for related sequences identified by cross-hybridisation studies. Where genes are identified by database searching and where no other functional information is available and there is some sequence similarity with a known gene they are designated with the symbol of the known gene followed by an "L" for like e.g. ACY1L "aminoacylase 1-like". Alternatively, if there are a number of similar genes they may be assigned numbers in a series e.g. BTNL1 "butyrophilin-like 1" to BTNL3.

Genes of unknown function
Genes predicted from EST clusters or from genomic sequence with EST evidence, but showing no structural or functional homology, are regarded as putative. These are designated by the chromosome of origin, the letters orf for open reading frame and a number in a series e.g. C2orf1, "chromosome 2 open reading frame 1". The use of the lower case letters "orf" is to prevent confusion between the first letter "o" and the numeral "0" (zero), which may be part of the chromosome number.

Pseudogenes
Pseudogenes are sequences that are generally untranscribed and untranslated and which have high homology to identified genes . However, it has recently been shown that in different organisms or tissues functional activation may occur. Therefore, the previous policy of assigning the gene symbol of the structural gene followed by "P" and a number will only be approved on a case by case basis. In future, pseudogenes will usually be assigned the next number in the relevant symbol series, suffixed by a "P" for pseudogene if requested e.g. OR5B12P"olfactory receptor, family 5, subfamily B, member 12 pseudogene". However, the designation pseudogene will remain in the gene name.

7. Enzymes and proteins

The rules described in the sections on gene names and symbols apply, but in addition the names of genes coding for enzymes should be based on those recommended by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology e.g. FPGS "folylpolyglutamate synthase". These can be found at URL <http://ca.expasy.org/enzyme/>. Names of genes encoding plasma proteins, hemoglobins, and specialised proteins are based on standard names and those recommended by their respective committees e.g. HBA1 "hemoglobin, alpha 1".

8. Clinical disorders

The first gene symbol allocated to an inherited clinical phenotype (monogenic Mendelian inheritance) may be based on an acronym which has been established as a name for the disorder, whilst following the rules described previously e.g. ACH for "achondroplasia". It is usual for this symbol to change when the gene product or function is identified; however, if there is no additional information derived from the cloned gene, the disease symbol e.g. ACHwill be maintained. If an approved gene symbol for the cloned gene based on product or function already exists this will take precedence over the symbol derived from the clinical disorder when the gene descriptions are merged. For example, in the case of achondroplasia the symbol ACH was withdrawn and FGFR3 and the name "fibroblast growth factor receptor 3 (achondroplasia, thanatophoric dwarfism)" approved in its place. If the original symbol is to be maintained we do not recommend the sole use of invented names based upon the phenotype e.g. tuberin for TSC2 "tuberous sclerosis 2".

Complex/polygenic traits
Genome searches may suggest a contributing locus in a complex trait, which may for convenience be given a gene symbol e.g. IDDM3 "insulin-dependent diabetes mellitus 3", although a proportion of these will disappear in time. A symbol allocated to such a locus will not be re-used.

Contiguous gene syndromes
Syndromes clearly associated with multiple widely dispersed loci should not be given gene symbols. Syndromes associated with a regional deletion or duplication may be assigned the letters CR (for chromosome region), in place of S for syndrome. Examples include ANCR "Angelman syndrome chromosome region", CECR "cat eye syndrome chromosome region". However, as advances in database design have increased the possible ways of representing this type of information, we recommend that such symbols are classified as syndromic region symbols and not gene symbols. Candidate genes found within a chromosomal region may be assigned a symbol based on the syndromic region symbol with sequential numbering e.g. CECR1 "cat eye syndrome chromosome region, candidate 1".

Loss of heterozygosity
A chromosomal region in which the existence of genes may be inferred by loss of heterozygosity can be designated by a symbol consisting of the letters LOH for loss of heterozygosity, the chromosome number, CR (for chromosomal region) and then a sequential number e.g. LOH1CR1 "loss of heterozygosity, 1, chromosomal region 1".

9. Genomic Rearrangements and Features

Genes only found within subsets of the population
Symbols will be assigned to genes generated by recombination events. The symbols will be designated a number or letter in the series determined from the original gene symbol e.g. KIR2DL5 gene is duplicated in some haplotypes, the original gene will be referred to as KIR2DL5A "killer cell immunoglobulin-like receptor, two domains, long cytoplasmic tail, 5A" and the duplication as KIR2DL5B "killer cell immunoglobulin-like receptor, two domains, long cytoplasmic tail, 5B". Information detailing position on the chromosome e.g. telomeric etc may be added to the gene name to aid understanding. These types of symbols will be agreed on an individual basis.

Gene recombination
When there has been recombination between genes, generating a new locus, the gene symbol should remain short but reflect the two (or more) initial gene symbols. The name may be as descriptive as necessary e.g. POMZP3 "POM (POM121 rat homolog) and ZP3 fusion".

Functional Repeat sequences
Sequence data for these records must be obtained by HGNC prior to symbol approval. Endogenous retroviruses have the stem symbol ERV, followed by a letter representing type and the next consecutive number in the series e.g. ERVK2 "endogenous retroviral sequence K(C4), 2". Transposable elements have the stem symbol TE, followed by a relevant character set e.g. MAR, and the next consecutive number in the series. Thus, a mariner-type transposable element could be TEMAR1 "transposable element mariner-type, 1". LINEs (Long Interspersed Nuclear Elements) are currently assigned the next consecutive number in the LRE series e.g. LRE1 "LINE retrotransposable element 1".

10. Characters reserved for specific usage

Letters, or combinations of letters are used as prefixes or suffices in a symbol to represent a specific meaning and so their use for other meanings should be avoided where possible (Table 1). The following letter usage is not acceptable; H or h for human, G or g for gene. If the name of a gene contains a character or property for which there is a recognized abbreviation, the abbreviation should be used e.g. the single-letter abbreviation for amino acids used in aminoacyl residues (App 2: Table 3). Other characters are also used within the symbol to represent genomic or genetic information (Table 1).

Table 1: Characters reserved for specific usage

Character	Meaning
AS	antisense
AP	associated/accessory protein
BP	binding protein
C	catalytic
CR	chromosome/critical region
C#orf#	chromosome # open reading frame #
D or DC	domain containing
FAM	family with sequence similarity
N or NH	inhibitor
IP	interacting protein
IT	intronic transcript
LG	ligand
L	like
LOH	loss of heterozygosity
MT or M	mitochondrial
OS	opposite strand
OT	overlapping transcript
P	pseudogene
Q	quantitative trait
R	receptor
RG	regulator
@	gene cluster in chromosomal region
#	gene family

11. Symbol Status

Each gene record in the Genew database has one of five states:

Pending: gene symbols in progress.
Approved: official gene symbols that are publicly available.
Reserved : official gene symbols that are not publicly available, these will be released after a maximum period of six months.
Symbol Withdrawn: previous symbols for genes which now have different approved symbols.
Entry Withdrawn: for symbols which refer to a gene that has since been shown not to exist.

Appendix 1: Gene symbol use in publications

It is recommended that gene and allele symbols are underlined in manuscript and italicized in print. Italics need not be used in gene catalogs. To distinguish between mRNA, genomic DNA and cDNA the relevant prefix should be written in parentheses (mRNA)RBP1, (gDNA)RBP1, (cDNA)RBP1.

DNA segments
These symbols can be obtained from The Genome Database (GDB) and are assigned automatically to arbitrary DNA fragments and loci. Please email requests to help@gdb.org. These symbols comprise five parts e.g. DXS9879E, described by the following guidelines
Part 1: D for DNA
Part 2: 0, 1, 2,...22, X, Y, XY for the chromosomal assignment, where XY is for segments homologous on the X and Y chromosomes, and 0 is for unknown chromosomal assignment.
Part 3: S, Z or F indicating the complexity of the DNA segment detected by the probe; with S for a unique DNA segment, Z for repetitive DNA segments found at a single chromosome site and F for small undefined families of homologous sequences found on multiple chromosomes.
Part 4: 1, 2, 3,..., a sequential number to give uniqueness to the above concatenated characters.
Part 5: When the DNA segment is known to be an expressed sequence the suffix E can be added to indicate this.

Alleles
Allele terminology is the responsibility of the HUGO Mutation Database Initiative (now evolved into the Human Genome Variation Society) <http://ariel.ucs.unimelb.edu.au/~cotton/mdi.htm>. However, alleles should still be limited to an optimum of three characters using only capital letters or Arabic numerals. The allele designation should be written on the same line as gene symbol separated by an asterisk e.g. PGM1*1, the allele is printed as *1. When only allele symbols are displayed an asterisk should precede them. Sequence variation nomenclature, in the form of mutations and polymorphisms, is described by Antonarakis et al (1998) and den Dunnen and Antonarakis (2001). HLA alleles are assigned by the WHO Nomenclature Committee for Factors of the HLA System (Marsh et al 2001) via the IMGT/HLA database (Robinson et al. 2001). Immunoglobulin and T cell receptor alleles are assigned by the IMGT Nomenclature Committee via the IMGT/LIGM database (Lefranc 2001)".

Splice variants
The HGNC has no authority over protein nomenclature; however, we are frequently asked how to designate splice variants so we suggest the following:
Proteins should be designated using the same symbol as the gene, printed in non-italicized letters. When referring to splice variants, the symbol can be followed by an underscore and the lower case letter "v" then a consecutive number to denote which variant is which e.g. G6PD_v1.

Species designations
When necessary to distinguish the species of origin for homologous genes with the same gene symbol, the letter-based code for different species already established by SWISS-PROT should be used. This can be found at URL <http://www.expasy.ch/cgi-bin/speclist> and commonly used species are shown in App 2: Table 2. The code is for use in publications only and should not be incorporated as part of the gene symbol. The species designation is added as a prefix to the gene symbol in parentheses. For example HUMAN signifies Homo sapiens and MOUSE signifies Mus musculus e.g. human genes: (HUMAN)ABCA1; (HUMAN)SLC13A2; homologous mouse genes: (MOUSE)Abca1; (MOUSE)Slc13a2.

Appendix 2: Tables

App 2: Table 1: Greek-to-Latin alphabet conversion

Greek Symbol	Greek	Latin upper case conversion
a	alpha	A
b	beta	B
g	gamma	G
d	delta	D
e	epsilon	E
z	zeta	Z
h	eta	H
q	theta	Q
i	iota	I
k	kappa	K
l	lambda	L
m	mu	M
n	nu	N
x	xi	X
o	omicron	O
p	pi	P
r	rho	R
s	sigma	S
t	tau	T
u	upsilon	Y
f	phi	F
c	chi	C
y	psi	U
w	omega	W

App 2: Table 2: Species Abbreviations from URL: <http://www.expasy.ch/cgi-bin/speclist>

Common name	Abbreviation	Species
cat	FELSI	Felis silvestris catus
chicken	CHICK	Gallus gallus
chimp	PANTR	Pan troglodytes
dog	CANFA	Canis familiaris
fruit fly	DROME	Drosophila melanogaster
human	HUMAN	Homo sapiens
mouse	MOUSE	Mus musculus
orangutan	PONPY	Pongo pygmaeus
pig	PIG	Sus scrofa
puffer fish	FUGRU	Fugu rubripes
puffer fish	TETFL	Tetraodon fluviatilis
rabbit	RABIT	Oryctolagus cuniculus
rat	RAT	Rattus norvegicus
worm	CAEEL	Caenorhabditis elegans
yeast	YEAST	Saccharomyces cerevisiae
yeast	SCHPO	Schizosaccharomyces pombe
zebrafish	BRARE	Brachydanio rerio

App 2: Table 3: Single-letter amino acid symbols

Amino acid	Three-letter symbol	One-letter symbol
Alanine	Ala	A
Arginine	Arg	R
Asparagine	Asn	N
Aspartic acid	Asp	D
Asn +Asp	Asx	B
Cysteine	Cys	C
Glutamine	Gln	Q
Glutamic acid	Glu	E
Gln + Glu	Glx	Z
Glycine	Gly	G
Histidine	His	H
Isoleucine	Ile	I
Leucine	Leu	L
Lysine	Lys	K
Methionine	Met	M
Phenylalanine	Phe	F
Proline	Pro	P
Serine	Ser	S
Threonine	Thr	T
Tryptophan	Trp	W
Tyrosine	Tyr	Y
Valine	Val	V

References

Antonarakis SE. Recommendations for a nomenclature system for human gene mutations. Nomenclature Working Group. Hum Mutat. 11(1):1-3. 1998

den Dunnen JT, Antonarakis SE. Mutation nomenclature extensions and suggestions to describe complex mutations: A discussion. Hum Mutat. 15(1):7-12. 2000

Lander ES, et al. The International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 409(6822):860-921. 2001.

Lefranc MP. IMGT, the international ImMunoGeneTics database. Nucleic Acids Res. 29: 207-209. 2001.

Mackenzie PI, Owens IS, Burchell B, Bock KW, Bairoch A, Belanger A, Fournel-Gigleux S, Green M, Hum DW, Iyanagi T, Lancet D, Louisot P, Magdalou J, Chowdhury JR, Ritter JK, Schachter H, Tephly TR, Tipton KF, Nebert DW. The UDP glycosyltransferase gene superfamily: recommended nomenclature update based on evolutionary divergence. Pharmacogenetics. 7(4):255-69. 1997.

McAlpine PJ. International system for human gene nomenclature. Trends in Genetics 39-42 1995.

Marsh SGE, Bodmer JG, Albert ED, Bontrop RE, Bodmer WF, Dupont B, Erlich HA, Hansen JA, Mach B, Mayr WR, Parham P, Petersdorf EW, Sasazuki T, Schreuder GM, Strominger J, Svejgaard A, Terasaki PI. Nomenclature for factors of the HLA System, 2000. Tissue Antigens 57 236-283. 2001.

Nelson DR, Koymans L, Kamataki T, Stegeman JJ, Feyereisen R, Waxman DJ, Waterman MR, Gotoh O, Coon MJ, Estabrook RW, Gunsalus IC, Nebert DW. P450 superfamily: update on new sequences, gene mapping, accession numbers and nomenclature. Pharmacogenetics.6(1):1-42. 1996.

Robinson J, Waller MJ, Parham P, Bodmer JG, and Marsh SGE. IMGT/HLA Database - a sequence database for the human major histocompatibility complex. Nucleic Acids Res. 29: 210-213, 11. 2001.

Shows, T.B., McAlpine, P.J., Boucheix, C., Collins, F.S., Conneally, P.M., Frezal, J., Gershowitz, H., Goodfellow, P.N., Hall, J.G., Issitt, P., Jones, C.A., Knowles, B.B., Lewis, M., McKusick, V.A., Meisler, M., Morton, N.E., Rubinstein, P., Schanfield, M.S., Schmickel, R.D., Skolnick, M.H., Spence, M.A., Sutherland, G.R., Traver, M., Van Cong, N., and Willard, H.F. Guidelines for human gene nomenclature: An international system for human gene nomenclature (ISGN, 1987). Cytogenet Cell Genet. 1987;46(1-4):11-28. 1987.

Shows, T.B., Alper, C.A., Bootsma, D., Dorf, M., Douglas, T., Huisman, T., Kit, S., Klinger, H.P., Kozak, C., Lalley, P.A., Lindsley, D., McAlpine, P.J., McDougall, J.K., Meera Khan, P., Meisler, M., Morton, N.E., Opitz, J.M., Partridge, C.W., Payne, R., Roderick, T.H., Rubinstein, P., Ruddle, F.H., Shaw, M., Spranger, J.W., and Weiss, K. International System for Human Gene Nomenclature (1979). Cytogenet Cell Genet 25: 96-116, 1979.

Venter et al. The sequence of the human genome. Science. 291(5507):1304-51. 2001.

White J, et al. Networking nomenclature. Nat Genet. 18(3): 209. 1998.

White JA, et al. Guidelines for human gene nomenclature. HUGO Nomenclature Committee. Genomics. 1997 Oct 15; 45(2): 468-471. 1997.

Maintained by hgnc@genenames.org
Last updated: 27 Nov 2001	HGNC Homepage