Gene id conversion
This page describes various ways how to convert gene IDs from one format to another, e.g. if you have RefSeq identifiers but need gene symbols (e.g. "BRCA2"), or have NCBI Entrez Genes identifiers and need Ensembl identifiers.
With the UCSC Browser
(from the mailing list)
There are three options for extracting the data.
- merge/download data from the Table browser
- query the public mySQL database
- ftp text files
Merge/download data from the Table browser http://genome.ucsc.edu/cgi-bin/hgTables
- set the clade, genome, assembly
- set the group to Genes and Gene Prediction Tracks,
- for the query use UCSC Genes default table is knownGene. Click on "View table schema" to see field contents/order.
- set region: genomic for entire dataset or filter by region or identifiers.
- You can upload a list of Entrez Gene names (or other identifiers) at this point to limit the output, but it is not necessary, you can filter the file later. For identifiers, either upload your list of gene IDs, or paste them in
- This primary table (knownGene) does not contain alternate gene names. To link those in:
- set output format: selected fields from primary table and related tables
- name output file so that it will download
- add in the linked table kgXref and check columns to download, then submit
- Starting again at step c, do the same for the Ensembl Genes track. Do the same steps until step g, where you will first need link in the table knownToEnsembl, then the table kgXref.
Table Browser Help/FAQ: http://genome.ucsc.edu/cgi-bin/hgTables#Help http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html
You can convert gene IDs into gene names using the Table Browser (Tables). Here are the steps:
- Select clade, genome, assembly of interest - Select group: Genes and Gene Prediction Tracks - Select track: UCSC Genes
- Select output format: selected fields from primary and related tables
- Click Get Output
On the resulting screen: - Check kgID and geneSymbol - click "get output"
You should get out put that looks like:
- kgID geneSymbol
uc001aal.1 OR4F5 uc001aaq.1 DQ599874 uc001aar.1 DQ599768 uc001aax.1 BC036251 uc001aba.1 X64709 uc001abv.1 SAMD11 uc001abw.1 SAMD11 uc001abx.1 SAMD11 uc001aca.1 KLHL17 uc001acb.1 KLHL17 uc001acc.1 KLHL17 uc009vjh.1 OR4F5 uc009vjn.1 LOC643837 uc009vjo.1 LOC643837
chatGPT suggestions in 2024 =
1. BioMart (Ensembl)
- *Website*: [BioMart Ensembl](https://www.ensembl.org/biomart/martview)
- This tool allows you to query Ensembl databases and convert various gene identifiers (Ensembl, Gencode, etc.) to human-readable gene symbols. You can also convert other types of IDs (e.g., HGNC symbols, RefSeq IDs).
- How to use it:
- Select a dataset (e.g., "Human genes" under "Ensembl Genes").
- Under "Filters," choose the input gene IDs.
- Under "Attributes," select "External Gene Names" to output the gene symbols.
2. DAVID (Database for Annotation, Visualization, and Integrated Discovery)
- *Website*: [DAVID](https://david.ncifcrf.gov/)
- DAVID is a powerful resource for converting gene IDs and also for performing functional annotation. You can upload lists of gene IDs, and it will return gene symbols along with various annotations.
- *How to use it*:
- Input your list of gene IDs.
- Choose the identifier you are converting from.
- Select "Gene Name" or "Gene Symbol" for conversion.
3. UniProt
- **Website**: [UniProt ID Mapping](https://www.uniprot.org/uploadlists/)
- UniProt provides a flexible tool to map between gene/protein IDs like Ensembl IDs, RefSeq, UniProt IDs, and gene symbols.
- **How to use it**:
- Upload a list of IDs.
- Select the database you are converting from and to (e.g., Ensembl Gene to Gene Name).
4. HGNC (HUGO Gene Nomenclature Committee)
- **Website**: [HGNC Custom Downloads](https://www.genenames.org/download/custom/)
- The HGNC provides authoritative gene symbol assignments. You can use their custom download tool to map between gene symbols, Ensembl IDs, and other identifiers.
- **How to use it**:
- Choose the input type, e.g., Ensembl Gene ID, and specify the output as gene symbols.
5. g:Profiler
- **Website**: [g:Profiler](https://biit.cs.ut.ee/gprofiler/)
- This tool allows for ID mapping and also provides functional profiling of gene lists.
- **How to use it**:
- Input your list of gene IDs.
- Select the conversion options and target gene names.
These tools should cover most gene ID to gene symbol conversion needs for the human genome.
With the UCSC Browser public mySQL database
- Using the Table Browser to help you understand the database and table
names/format, write your own SQL query to extract data.
- Public mySQL FAQ:
http://genome.ucsc.edu/FAQ/FAQdownloads#download29
- e.g.
mysql --no-defaults -h genome-mysql.soe.ucsc.edu -u genome -A hg19 -NB -e 'select geneSymbol, txEnd-txStart AS Size from knownGene, kgXref where kgXref.kgId=knownGene.name'
With the UCSC Browser FTP text files
- Use ftp to get the complete tables in text file format and perform data merges to link the aligned transcripts in the primary tables to the gene names (such as Entrez)
- You would need to use your own shell, perl, or other tools to do the merges. Again, first use the Table Browser navigation tools to help you understand the database and table names/format.
- Download ftp FAQ: http://genome.ucsc.edu/FAQ/FAQdownloads#download1
All annotation tracks are mapped using the same coordinate system to the genomic assembly and so are directly comparable. Be aware that we use a zero-based start coordinate and a 1-based stop coordinate. We also record all alignments with respect to the positive strand, so if an alignment is on the negative strand, the start and stop will be reversed if compared to the file/table headers. These links describe in detail our file format conventions: http://genome.ucsc.edu/FAQ/FAQformat http://genome.ucsc.edu/FAQ/FAQtracks#tracks1 http://genome.ucsc.edu/FAQ/FAQtracks#tracks17
A final option is to send the data to Galaxy (from the Table Browser or uploaded text files). The functions for Interval format data are very useful and could aid in grouping the various mapped transcripts together into genes/clusters. It may be worth comparing which transcripts the Interval functions will group versus which the Entrez gene name will group. Link to Galaxy FAQ: http://g2.trac.bx.psu.edu/wiki/GopsDesc
With Biomart (for Ensembl IDs)
Biomart http://www.biomart.org is probably the best solution if your source ids are from Ensembl:
Click-Path:
- martview (top-right of screen)
- ensembl56 genes
- (select your species)
- "filters"
- "gene"
- paste your ids into "id list limit"
- "attributes"
- "GENE"
- uncheck "ensembl transcript id"
- uncheck "ensembl gene id" if you want to get rid of it
- "EXTERNAL"
- check "HGNC symbol" (or "HGNC automatic gene name" if not human)
- "results"
With external tools
List of some external tools and comparison
- David and Matchminer were the best ones when compared with 100 random identifiers
- Roth lab, http://llama.mshri.on.ca/synergizer/translate/