BioMart help
The HGNC BioMart application allows users to create customised data tables without the need for any programming knowledge by interacting with a form to filter the data and select the columns/attributes they want within the table. This page details how to interact with the BioMart Mart form and provides definitions of the filters and attributes.
BioMart overview & project
BioMart is a generic data management system which offers a range of advanced query interfaces and administration tools.
The system comes with built-in support for query-optimisation and database federation. BioMart provides users with the ability to conduct fast, powerful queries using either web, graphical, or text based applications, or programatically using web service or software libraries written in Perl and Java. For data providers, the system simplifies the task of integrating their own data with other datasets hosted on the network.
All the software, including an easy to install BioMart website, is available for local installation. BioMart software is completely Open Source, licensed under the LGPL, and freely available to anyone without restrictions.
For more information about the BioMart project and to download the code visit the BioMart site.
HGNC Marts
The HGNC BioMart homepage provide a list of HGNC Marts that are available to use. By clicking on a Mart name the user will be taken to a mart form for the dataset of choice. So far we have two marts to choose from, a gene mart for gene symbol centric data and a family mart for the gene group centric data.
All the mart forms have the same template where the form is split into three parts, Datasets, Filters and Attributes.
Datasets
The datasets part of the mart form is for the user to select the database and the dataset they would like to query and download. The HGNC only have one database and so the database dropdown can be ignored. If the user has entered the site via the HGNC BioMart homepage the user will not have to change the dataset. However if the user has changed their mind and want to download data from another dataset the user can select a different dataset using the "Datasets" dropdown box which will change the form. As we have already mentioned we have two datasets to choose from so far, the gene dataset and the family/group dataset.
Filters
The filters section is an area for the user to filter the data by the provided fields. There are several types of filter for the user to interact with, the most common type being the text input filter. The filters are split into subsections, according to the type of field/data they filter. Filters are not required for a BioMart search. If a user wants to select attributes for all the data in the dataset they should ignore this section of the form.
- Text input filters
- Text input filters usually allow the user to add a wildcard "%" symbol to allow BioMart to search the field for data that is like the filter query. Text input filters have [wildcard = %] displayed in the label if wildcards are allowed.
- Select box filters
- Select box filters are easy to use in that all the user has to do is click on the filter and select the value to filter by. By default the filter will say "-- Select --" and by leaving it like this BioMart will ignore the filter.
- Multiple select filters
- Multiple select filters are scroll boxes that contain many values per line. To filter by a particular value the user can click on that value. If the user would like to filter on many values, a user using a windows computer should hold down the control (ctrl) key and click on another value. Mac users need to hold down the command (cmd) key instead.
- Bulk upload filters
- Our Mart forms also have bulk upload filters. The user first selects the field in which they would like to query multiple time by selecting a value within the drop down select box. The user can then place their values within the text area box or click the "upload file" link to select a file which contains the query values. All of the values have to be of the type selected within the drop down (i.e a user cannot provide a file or type in values that contain mixed ID/symbol/accession types).
Gene filters
- HGNC data filter
-
- Approved symbol
- The official gene symbol that has been approved by the HGNC and is publicly available. Symbols are approved based on specific HGNC nomenclature guidelines. In the HTML results page this ID links to the HGNC Symbol Report for that gene.
- Approved name
- The official gene name that has been approved by the HGNC and is publicly available. Names are approved based on specific HGNC nomenclature guidelines.
- Locus group
- Groups locus types together into related sets. Below is a list of groups and the locus types within the group:
- protein-coding gene - contains the "gene with protein product" locus type
- non-coding RNA - contains the following locus types:
- RNA, cluster
- RNA, long non-coding
- RNA, micro
- RNA, ribosomal
- RNA, small cytoplasmic
- RNA, small misc
- RNA, small nuclear
- RNA, small nucleolar
- RNA, transfer
- pseudogene - contains the following types:
- immunoglobulin pseudogene
- pseudogene
- T cell receptor pseudogene
- phenotype - contains the "phenotype only" locus type
- other - contains the following types:
- endogenous retrovirus
- fragile site
- immunoglobulin gene
- readthrough
- region
- T cell receptor gene
- transposable element
- unknown
- virus integration site
- withdrawn - contains the "withdrawn" locus type only
- Locus type
- Specifies the type of locus described by the given entry:
- complex locus constituent - transcriptional unit that is part of a named complex locus
- endogenous retrovirus - integrated retroviral elements that are transmitted through the germline (SO:0000100)
- fragile site - a heritable locus on a chromosome that is prone to DNA breakage
- gene with protein product - protein-coding genes (the protein may be predicted and of unknown function) (SO:0001217)
- immunoglobulin gene - gene segments that undergo somatic recombination to form heavy or light chain immunoglobulin genes (SO:0000460)
- immunoglobulin pseudogene - immunoglobulin gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame
- phenotype only - mapped phenotypes (SO:0001500)
- pseudogene - genomic DNA sequences that are similar to protein-coding genes but do not encode a functional protein (SO:0000336)
- readthrough - a naturally occurring transcript containing coding sequence from two or more genes that can also be transcribed individually
- region - extents of genomic sequence that contain one or more genes, also applied to non-gene areas that do not fall into other types
- RNA, cluster - region containing a cluster of small non-coding RNA genes
- RNA, long non-coding - non-protein coding genes that encode long non-coding RNAs (lncRNAs); these are at least 200 nt and are represented by a processed trancript and/or at least 3 ESTs
- RNA, micro - non-protein coding genes that encode microRNAs (miRNAs) (SO:0001265)
- RNA, ribosomal - non-protein coding genes that encode ribosomal RNAs (rRNAs) (SO:0001637)
- RNA, small nuclear - non-protein coding genes that encode small nuclear RNAs (snRNAs) (SO:0001268)
- RNA, small nucleolar - non-protein coding genes that encode small nucleolar RNAs (snoRNAs) containing C/D or H/ACA box domains (SO:0001267)
- RNA, small cytoplasmic - non-protein coding genes that encode small cytoplasmic RNAs (scRNAs) (SO:0001266)
- RNA, transfer - non-protein coding genes that encode transfer RNAs (tRNAs) (SO:0001272)
- RNA, small misc - non-protein coding genes that encode miscellaneous types of small ncRNAs
- T cell receptor gene - gene segments that undergo somatic recombination to form either alpha, beta, gamma or delta chain T cell receptor genes (SO:0000460)
- T cell receptor pseudogene - T cell receptor gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame
- transposable element - a segment of repetitive DNA that can move, or retrotranspose, to new sites within the genome (SO:0000101)
- unknown - entries where the locus type is currently unknown
- virus integration site - target sequence for the integration of viral DNA into the genome
- Bulk upload filter
-
- Filter by ID, accession or symbol
- This field allows the user to provide multiple query values to bulk search BioMart. The list of values must all be of the type selected using the drop down box. Values can be typed/pasted into the text area or uploaded within a file by clicking on the "upload file" link. The types accepted in this filter are as follows:
- HGNC ID(s) - A unique ID provided by the HGNC for each gene with an approved symbol. IDs are of the format HGNC:n, where n is a unique number.
- Approved symbols - The official gene symbol that has been approved by the HGNC.
- Alias gene symbols - Other symbols used to refer to the gene.
- Previous HGNC symbols - Symbols previously approved by the HGNC for the gene.
- CCDS accessions - The Consensus CDS (CCDS) accession.
- INSDC (ENA/GenBank/DDBJ) accessions - INSDC nucleotide sequence accession numbers.
- Ensembl gene ID(s) - The ID for an Ensembl gene entry.
- Mouse genome informatics (MGI) ID(s) - Mouse Genome Informatics ID for a mouse homolog of human genes.
- NCBI Gene ID(s) - IDs that are associated with a gene with NCBI gene.
- OMIM ID(s) - Identifier from the Online Mendelian Inheritance in Man (OMIM).
- Orphanet ID(s) - The Orphanet ID identifies a gene within orphanet and the rare diseases that are associated to the gene.
- Pseudogene.org ID(s) - An ID for a pseudogene entry/sequence within the Pseudogene.org database.
- RefSeq accessions - The Reference Sequence (RefSeq) identifier.
- Rat Genome Database (RGD) ID(s) - Rat Genome Database ID for a rat homolog of human genes.
- UniProt accessions - The UniProt identifier for a protein product of a gene.
- Vega gene ID(s) - The Vega gene ID.
Family filters
- HGNC data filter
- Bulk upload filter
-
- Filter by IDs or symbols
- This field allows the user to provide multiple family/group IDs, HGNC (gene) IDs and approved gene symbols to BioMart to search. The list of values must all be of the type selected using the drop down box. Values can be typed/pasted into the text area or uploaded within a file by clicking on the "upload file" link.
Attributes
The Attributes section of the form is where the user selects what they want displayed within their table for download and it is a requirement of BioMart to select at least one attribute. On both the gene and family marts some of the key attributes are selected by default however the user can deselect these defaults. The attribute section is divided up into subsections to group similar attributes fields together. To select or deselect an attribute the user should click on the check box next to the attributes label. Alternatively the user can select or deselect all the attributes within subsection by clicking on the links labelled "select all" and "select none".
Gene attributes
- HGNC data
-
- Approved symbol
- The official gene symbol that has been approved by the HGNC and is publicly available. Symbols are approved based on specific HGNC nomenclature guidelines.
- Approved name
- The official gene name that has been approved by the HGNC and is publicly available. Names are approved based on specific HGNC nomenclature guidelines.
- Locus group
- Groups locus types together into related sets. Below is a list of groups and the locus types within the group:
- protein-coding gene - contains the "gene with protein product" locus type
- non-coding RNA - contains the following locus types:
- RNA, cluster
- RNA, long non-coding
- RNA, micro
- RNA, ribosomal
- RNA, small cytoplasmic
- RNA, small misc
- RNA, small nuclear
- RNA, small nucleolar
- RNA, transfer
- pseudogene - contains the following types:
- immunoglobulin pseudogene
- pseudogene
- T cell receptor pseudogene
- phenotype - contains the "phenotype only" locus type
- other - contains the following types:
- endogenous retrovirus
- fragile site
- immunoglobulin gene
- readthrough
- region
- T cell receptor gene
- transposable element
- unknown
- virus integration site
- withdrawn - contains the "withdrawn" locus type only
- Locus type
- Specifies the type of locus described by the given entry:
- complex locus constituent - transcriptional unit that is part of a named complex locus
- endogenous retrovirus - integrated retroviral elements that are transmitted through the germline (SO:0000100)
- fragile site - a heritable locus on a chromosome that is prone to DNA breakage
- gene with protein product - protein-coding genes (the protein may be predicted and of unknown function) (SO:0001217)
- immunoglobulin gene - gene segments that undergo somatic recombination to form heavy or light chain immunoglobulin genes (SO:0000460)
- immunoglobulin pseudogene - immunoglobulin gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame
- phenotype only - mapped phenotypes (SO:0001500)
- pseudogene - genomic DNA sequences that are similar to protein-coding genes but do not encode a functional protein (SO:0000336)
- readthrough - a naturally occurring transcript containing coding sequence from two or more genes that can also be transcribed individually
- region - extents of genomic sequence that contain one or more genes, also applied to non-gene areas that do not fall into other types
- RNA, cluster - region containing a cluster of small non-coding RNA genes
- RNA, long non-coding - non-protein coding genes that encode long non-coding RNAs (lncRNAs); these are at least 200 nt and are represented by a processed trancript and/or at least 3 ESTs
- RNA, micro - non-protein coding genes that encode microRNAs (miRNAs) (SO:0001265)
- RNA, ribosomal - non-protein coding genes that encode ribosomal RNAs (rRNAs) (SO:0001637)
- RNA, small nuclear - non-protein coding genes that encode small nuclear RNAs (snRNAs) (SO:0001268)
- RNA, small nucleolar - non-protein coding genes that encode small nucleolar RNAs (snoRNAs) containing C/D or H/ACA box domains (SO:0001267)
- RNA, small cytoplasmic - non-protein coding genes that encode small cytoplasmic RNAs (scRNAs) (SO:0001266)
- RNA, transfer - non-protein coding genes that encode transfer RNAs (tRNAs) (SO:0001272)
- RNA, small misc - non-protein coding genes that encode miscellaneous types of small ncRNAs
- T cell receptor gene - gene segments that undergo somatic recombination to form either alpha, beta, gamma or delta chain T cell receptor genes (SO:0000460)
- T cell receptor pseudogene - T cell receptor gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame
- transposable element - a segment of repetitive DNA that can move, or retrotranspose, to new sites within the genome (SO:0000101)
- unknown - entries where the locus type is currently unknown
- virus integration site - target sequence for the integration of viral DNA into the genome
- Model organism databases
-
- Mouse genome informatics (MGI) ID
- Mouse Genome Informatics ID for the mouse homologs of the human gene.
- Rat genome database (RGD) ID
- Rat Genome Database ID for the rat homologs of the human gene.
- Gene resources
-
- NCBI gene ID
- The NCBI gene ID associated with the HGNC gene symbol. NCBI gene at the NCBI provide curated sequence and descriptive information about genetic loci including official nomenclature, synonyms, sequence accessions, phenotypes, EC numbers, MIM numbers, UniGene clusters, homology, map locations, and related web sites.
- UCSC gene ID
- The UCSC gene ID associated with the HGNC gene symbol. The ID is used within the UCSC genome browser to identify an annotated human gene record within the UCSC genome browser.
- Vega gene ID
- The Vega gene ID associated with the HGNC gene symbol. The VEGA database is a repository for high-quality gene models produced by the manual annotation of vertebrate genomes.
- Nucleotide resources
-
- CCDS accession
- The Consensus CDS (CCDS) project is a collaborative effort to identify a core set of human and mouse protein coding regions that are consistently annotated and of high quality. The long term goal is to support convergence towards a standard set of gene annotations.
- RefSeq accession
- The Reference Sequence (RefSeq) identifier displayed within the HGNC gene symbol report. RefSeq aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products. RefSeq identifiers are designed to provide a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses.
- Protein resources
-
- UniProt accession
- The UniProt identifier for a protein product of the gene. The UniProt Protein Knowledgebase is described as a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases.
- Enzyme (EC) ID
- Enzyme entries have Enzyme Commission (EC) numbers associated with them that indicate the hierarchical functional classes to which they belong.
- Clinical resources
-
- Cosmic symbol
- The gene symbol displayed within the Catalogue Of Somatic Mutations In Cancer (Cosmic). Most of the gene symbols will be the same as HGNC approved gene symbol but for some genes in Cosmic this may not be the case.
- OMIM ID
- Identifier provided by Online Mendelian Inheritance in Man (OMIM). This database is described as a catalog of human genes and genetic disorders containing textual information and links to additional related resources.
- Orphanet ID
- Orphanet is the reference portal for information on rare diseases and orphan drugs, for all audiences. Orphanet’s aim is to help improve the diagnosis, care and treatment of patients with rare diseases. The Orphanet ID identifies a gene within orphanet and the rare diseases that are associated to the gene.
- References
-
- PubMed ID
- Identifier that links to published articles relevant to the gene in the NCBI's PubMed database.
- Other external resources
-
- HCDM CD name
- The CD name for a cellular differentiation molecule found within the HCDM database.
- HomeoDB ID
- ID for a homeobox gene within the Homeobox database (HomeoDB2).
- HORDE symbol
- The ID for an olfactory receptor gene entry within the Human Olfactory Receptor Data Exploratorium (HORDE) database.
- IMGT gene symbol
- The IMGT/GENE-DB gene symbol for immunoglobulin and T-cell receptor genes associated to the HGNC gene. The gene symbols are either the same as, or equivalent to, HGNC approved gene symbols. Equivalent IMGT symbols include the character "/" which is not present in HGNC approved symbols. The presence of an IMGT gene symbol indicates that the gene can be found within the IMGT/GENE-DB.
- Intermediate filament database HGNC ID
- The HGNC ID stored within the Human Intermediate Filament Database for an intermediate filament gene.
- IUPHAR/BPS guide to pharmacology ID
- IUPHAR/BPS Guide to PHARMACOLOGY is an expert-driven guide to pharmacological targets and the substances that act on them. The ID is their object ID that is used as an identifier for a gene record within their database.
- mamit-tRNADB ID
- Mamit-tRNAdb is a compilation of mammalian mitochondrial tRNA genes. The ID refers to a tRNA gene within the mamit-tRNAdb database.
- Merops ID
- The MEROPS database is an information resource for peptidases (also termed proteases, proteinases and proteolytic enzymes) and the proteins that inhibit them.
- mirBase accession
- An accession number for a microRNA sequence within the miRBase database for the HGNC gene.
- Pseudogene.org ID
- An ID for a pseudogene entry/sequence within the Pseudogene.org database for the HGNC gene.
- SLC bioparadigms symbol
- The gene symbol for a solute carrier gene as found in the Bioparadigms SLC tables database.
- snoRNABase (snoid) ID
- snoRNABase is a comprehensive database of human H/ACA and C/D box snoRNAs. The ID itself refers to a snoRNA page within the database resource.
- LncRNADB symbol
- lncRNAdb is a database providing comprehensive annotations of eukaryotic long non-coding RNAs (lncRNAs). Most of the gene symbols will be the same as HGNC approved gene symbols however for some genes this may not be the case. Resource is now defunct.
Family/group attributes
- HGNC family/group attributes
-
- External family/group resources
-
- PubMed ID
- PubMed ID for a reference pertinent to the gene family/group. We do not aim to list all possible published papers on the family/group but we provide PubMed IDs to papers that first described the gene family/group in question or papers that are particularly relevant to the nomenclature of the genes.
- HGNC Gene attributes
-
- Approved symbol
- The official gene symbol that has been approved by the HGNC and is publicly available. Symbols are approved based on specific HGNC nomenclature guidelines. In the HTML results page this ID links to the HGNC Symbol Report for that gene.
- Approved name
- The official gene name that has been approved by the HGNC and is publicly available. Names are approved based on specific HGNC nomenclature guidelines.
- Locus group
- Groups locus types together into related sets. Below is a list of groups and the locus types within the group:
- protein-coding gene - contains the "gene with protein product" locus type
- non-coding RNA - contains the following locus types:
- RNA, cluster
- RNA, long non-coding
- RNA, micro
- RNA, ribosomal
- RNA, small cytoplasmic
- RNA, small misc
- RNA, small nuclear
- RNA, small nucleolar
- RNA, transfer
- pseudogene - contains the following types:
- immunoglobulin pseudogene
- pseudogene
- T cell receptor pseudogene
- phenotype - contains the "phenotype only" locus type
- other - contains the following types:
- endogenous retrovirus
- fragile site
- immunoglobulin gene
- readthrough
- region
- T cell receptor gene
- transposable element
- unknown
- virus integration site
- withdrawn - contains the "withdrawn" locus type only
- Locus type
- Specifies the type of locus described by the given entry:
- complex locus constituent - transcriptional unit that is part of a named complex locus
- endogenous retrovirus - integrated retroviral elements that are transmitted through the germline (SO:0000100)
- fragile site - a heritable locus on a chromosome that is prone to DNA breakage
- gene with protein product - protein-coding genes (the protein may be predicted and of unknown function) (SO:0001217)
- immunoglobulin gene - gene segments that undergo somatic recombination to form heavy or light chain immunoglobulin genes (SO:0000460)
- immunoglobulin pseudogene - immunoglobulin gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame
- phenotype only - mapped phenotypes (SO:0001500)
- pseudogene - genomic DNA sequences that are similar to protein-coding genes but do not encode a functional protein (SO:0000336)
- readthrough - a naturally occurring transcript containing coding sequence from two or more genes that can also be transcribed individually
- region - extents of genomic sequence that contain one or more genes, also applied to non-gene areas that do not fall into other types
- RNA, cluster - region containing a cluster of small non-coding RNA genes
- RNA, long non-coding - non-protein coding genes that encode long non-coding RNAs (lncRNAs); these are at least 200 nt and are represented by a processed trancript and/or at least 3 ESTs
- RNA, micro - non-protein coding genes that encode microRNAs (miRNAs) (SO:0001265)
- RNA, ribosomal - non-protein coding genes that encode ribosomal RNAs (rRNAs) (SO:0001637)
- RNA, small nuclear - non-protein coding genes that encode small nuclear RNAs (snRNAs) (SO:0001268)
- RNA, small nucleolar - non-protein coding genes that encode small nucleolar RNAs (snoRNAs) containing C/D or H/ACA box domains (SO:0001267)
- RNA, small cytoplasmic - non-protein coding genes that encode small cytoplasmic RNAs (scRNAs) (SO:0001266)
- RNA, transfer - non-protein coding genes that encode transfer RNAs (tRNAs) (SO:0001272)
- RNA, small misc - non-protein coding genes that encode miscellaneous types of small ncRNAs
- T cell receptor gene - gene segments that undergo somatic recombination to form either alpha, beta, gamma or delta chain T cell receptor genes (SO:0000460)
- T cell receptor pseudogene - T cell receptor gene segments that are inactivated due to frameshift mutations and/or stop codons in the open reading frame
- transposable element - a segment of repetitive DNA that can move, or retrotranspose, to new sites within the genome (SO:0000101)
- unknown - entries where the locus type is currently unknown
- virus integration site - target sequence for the integration of viral DNA into the genome
- Other external resources
-
- NCBI gene ID
- The NCBI gene ID associated with the HGNC gene symbol. NCBI gene at the NCBI provide curated sequence and descriptive information about genetic loci including official nomenclature, synonyms, sequence accessions, phenotypes, EC numbers, MIM numbers, UniGene clusters, homology, map locations, and related web sites.
- UCSC gene ID
- The UCSC gene ID associated with the HGNC gene symbol. The ID is used within the UCSC genome browser to identify an annotated human gene record within the UCSC genome browser.
- UniProt accession
- The UniProt identifier for a protein product of the gene. The UniProt Protein Knowledgebase is described as a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases.
- Vega gene ID
- The Vega gene ID associated with the HGNC gene symbol. The VEGA database is a repository for high-quality gene models produced by the manual annotation of vertebrate genomes.