Finding nearby genes

From genomewiki
Revision as of 19:21, 26 May 2017 by Cath Tyner (talk | contribs)
Jump to navigationJump to search

Introduction

If you are interested in a certain genomic position, or reference point, and you want to find a sample of nearby genes upstream and downstream from this position, you can create a script by copying one of the examples below.

  • Open your editor on the command line and create a script in your bin directory.
  • E.g.,
vi closestGene.sh
  • Paste one of the scripts below into your closestGene.sh file
  • Run the script.
closestGene.sh

Alternatives

  • Galaxy has a "Fetch closest non-overlapping feature" tool
  • the BedTools include a tool "closestBed"

Script for knownGene on hg18

View table schema for knownGene, hg18

#!/bin/sh

# given position chr1:710000-720000
# find a sample of genes near this upstream and downstream
C=chr1
S=710000
E=720000

echo "three upstream genes from ${C}:${S}-${E}"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -N -e \
'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM
   knownGene e,
   kgXref j
WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txEnd < '${S}'
ORDER BY e.txEnd DESC limit 3;' hg18

echo "three downstream genes from ${C}:${S}-${E}"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -N -e \
'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM
   knownGene e,
   kgXref j
WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txStart > '${E}'
ORDER BY e.txStart ASC limit 3;' hg18

This produces the output:

three upstream genes from chr1:710000-720000
+------+--------+--------+------------+----------+
| chr1 | 690107 | 703869 | uc001abo.1 | BC006361 |
| chr1 | 665195 | 665226 | uc001abn.1 | DQ599872 |
| chr1 | 665086 | 665147 | uc001abm.1 | DQ600587 |
+------+--------+--------+------------+----------+
three downstream genes from chr1:710000-720000
+------+--------+--------+------------+----------+
| chr1 | 752926 | 778860 | uc001abp.1 | BC102012 |
| chr1 | 752926 | 778860 | uc001abq.1 | BC042880 |
| chr1 | 752926 | 779603 | uc001abr.1 | CR601056 |
+------+--------+--------+------------+----------+

Script for ncbiRefSeq on hg38

Here is a script for the gene set ncbiRefSeq on hg38.

View table schema for ncbiRefSeq, hg38

#!/bin/sh

# for gene set ncbiRefSeq
# given position chr1:991973-991973
# find a sample of genes near this upstream and downstream

# Input your assembly
G=hg38
# Input the chr for reference point
C=chr1
# Input start for reference point
S=991973
# Input end for reference point
E=991973
# Input the number of nearby transcripts to output
N=10

# Any gene set can be used. If a different gene set is used, check that
# the field names are the same, they may need updating. To check this,
# go to the Table Browser, select your gene set, and click the link for
# "table schema" to see field names. Older assemblies may use the related
# kgXref table for gene alias/gene name.
# The last column is the distance from the comparison point.


echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for ncbiRefSeq set"
echo "last column is distance from reference point to transcript, ${S} - txEnd"
echo "Note: for reverse - strand items, txEnd is the 5' end, the transcription \
start site"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.name,"'${S}'" - e.txEnd
AS "'${S}'-txEnd" FROM
   ncbiRefSeq e,
   ncbiRefSeqLink j
WHERE e.name = j.id AND e.chrom="'${C}'" AND e.txEnd < "'${S}'"
ORDER BY e.txEnd DESC limit '${N}';' $G


echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for ncbiRefSeq set"
echo "last column is distance from reference point to transcript, ${E} - txEnd"
echo "Note: for reverse - strand items, txStart is the 3' end, not the transcription \
start site"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.name,"'${E}'" - e.txStart
AS "'${E}'-txStart" FROM
   ncbiRefSeq e,
   ncbiRefSeqLink j
WHERE e.name = j.id AND e.chrom="'${C}'" AND e.txStart > '${E}'
ORDER BY e.txStart ASC limit '${N}';' $G

This produces the output:

closest upstream transcripts from chr1:991973-991973 in hg38 for ncbiRefSeq set
last column is distance from reference point to transcript, 991973 - txEnd
Note: for reverse - strand items, txEnd is the 5' end, the transcription start site
+-------+---------+--------+--------+----------------+---------+--------------+
| chrom | txStart | txEnd  | strand | name           | name    | 991973-txEnd |
+-------+---------+--------+--------+----------------+---------+--------------+
| chr1  |  975198 | 982117 | -      | NM_001291367.1 | PERM1   |         9856 |
| chr1  |  975198 | 982117 | -      | NM_001291366.1 | PERM1   |         9856 |
| chr1  |  975198 | 982093 | -      | XM_017002583.1 | PERM1   |         9880 |
| chr1  |  975198 | 982021 | -      | XM_017002584.1 | PERM1   |         9952 |
| chr1  |  975197 | 981657 | -      | XM_017002585.1 | PERM1   |        10316 |
| chr1  |  966496 | 975108 | +      | NM_032129.2    | PLEKHN1 |        16865 |
| chr1  |  966496 | 975108 | +      | NM_001160184.1 | PLEKHN1 |        16865 |
| chr1  |  965819 | 974587 | +      | XM_006710944.3 | PLEKHN1 |        17386 |
| chr1  |  965819 | 974587 | +      | XM_017002476.1 | PLEKHN1 |        17386 |
| chr1  |  965819 | 974587 | +      | XM_017002474.1 | PLEKHN1 |        17386 |
+-------+---------+--------+--------+----------------+---------+--------------+
closest upstream transcripts from chr1:991973-991973 in hg38 for ncbiRefSeq set
last column is distance from reference point to transcript, 991973 - txEnd
Note: for reverse - strand items, txStart is the 3' end, not the transcription start site
+-------+---------+---------+--------+----------------+--------------+----------------+
| chrom | txStart | txEnd   | strand | name           | name         | 991973-txStart |
+-------+---------+---------+--------+----------------+--------------+----------------+
| chr1  |  998961 | 1000172 | -      | NM_021170.3    | HES4         |          -6988 |
| chr1  |  998961 | 1001052 | -      | XM_005244771.4 | HES4         |          -6988 |
| chr1  |  998963 | 1000172 | -      | NM_001142467.1 | HES4         |          -6990 |
| chr1  | 1013466 | 1014540 | +      | NM_005101.3    | ISG15        |         -21493 |
| chr1  | 1020101 | 1056119 | +      | XM_011541429.2 | AGRN         |         -28128 |
| chr1  | 1020101 | 1056119 | +      | XR_946650.2    | AGRN         |         -28128 |
| chr1  | 1020101 | 1056119 | +      | XM_005244749.3 | AGRN         |         -28128 |
| chr1  | 1020122 | 1056119 | +      | NM_198576.3    | AGRN         |         -28149 |
| chr1  | 1020122 | 1056119 | +      | NM_001305275.1 | AGRN         |         -28149 |
| chr1  | 1059706 | 1066441 | +      | XR_001737601.1 | LOC100288175 |         -67733 |
+-------+---------+---------+--------+----------------+--------------+----------------+

Script for refGene on hg19

Here is a script for the gene set refGene on hg19.

View the table schema for refGene, hg19

#!/bin/sh

# for gene set refGene
# given position chr1:991973-991973
# find a sample of genes near this upstream and downstream

# Input your assembly
G=hg19
# Input the chr for reference point
C=chr1
# Input start for reference point
S=991973
# Input end for reference point
E=991973
# Input the number of nearby transcripts to output
N=10

# This script uses the gene set refGene.
# Any gene set can be used. If a different gene set is used, check that
# the field names are the same, they may need updating. To check this,
# go to the Table Browser, select your gene set, and click the link for
# "table schema" to see field names. Older assemblies may use the related
# kgXref table for gene alias/gene name.
# The last column is the distance from the comparison point.

echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for refGene set"
echo "last column is distance from reference point to transcript, ${S} - txEnd"
echo "Note: for reverse - strand items, txEnd is the 5' end, the transcription \
start site"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.geneSymbol,"'${S}'" - e.txEnd
AS "'${S}'-txEnd" FROM
   refGene e,
   kgXref j
WHERE e.name = j.refseq AND e.chrom="'${C}'" AND e.txEnd < "'${S}'"
ORDER BY e.txEnd DESC limit 10;' $G

echo "closest downstream transcripts from ${C}:${S}-${E} in ${G} for refGene set"
echo "last column is distance from reference point to transcript, ${E} - txStart"
echo "Note: for reverse - strand items, txStart is the 3' end, not transcription \
start site"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.geneSymbol,"'${E}'" - e.txStart
AS "'${E}'-txStart" FROM
   refGene e,
   kgXref j
WHERE e.name = j.refseq AND e.chrom="'${C}'" AND e.txStart > '${E}'
ORDER BY e.txStart ASC limit 10;' $G
closest upstream transcripts from chr1:991973-991973 in hg19 for refGene set
last column is distance from reference point to transcript, 991973 - txEnd
Note: for reverse - strand items, txEnd is the 5' end, the transcription start site
+-------+---------+--------+--------+--------------+------------+--------------+
| chrom | txStart | txEnd  | strand | name         | geneSymbol | 991973-txEnd |
+-------+---------+--------+--------+--------------+------------+--------------+
| chr1  |  955502 | 991499 | +      | NM_198576    | AGRN       |          474 |
| chr1  |  948846 | 949919 | +      | NM_005101    | ISG15      |        42054 |
| chr1  |  934341 | 935552 | -      | NM_021170    | HES4       |        56421 |
| chr1  |  934343 | 935552 | -      | NM_001142467 | HES4       |        56421 |
| chr1  |  901876 | 910484 | +      | NM_032129    | PLEKHN1    |        81489 |
| chr1  |  901876 | 910484 | +      | NM_032129    | PLEKHN1    |        81489 |
| chr1  |  901876 | 910484 | +      | NM_001160184 | PLEKHN1    |        81489 |
| chr1  |  895966 | 901099 | +      | NM_198317    | KLHL17     |        90874 |
| chr1  |  879582 | 894679 | -      | NM_015658    | NOC2L      |        97294 |
| chr1  |  879582 | 894679 | -      | NM_015658    | NOC2L      |        97294 |
+-------+---------+--------+--------+--------------+------------+--------------+
closest downstream transcripts from chr1:991973-991973 in hg19 for refGene set
last column is distance from reference point to transcript, 991973 - txStart
Note: for reverse - strand items, txStart is the 3' end, not transcription start site
+-------+---------+---------+--------+--------------+------------+----------------+
| chrom | txStart | txEnd   | strand | name         | geneSymbol | 991973-txStart |
+-------+---------+---------+--------+--------------+------------+----------------+
| chr1  | 1007125 | 1009687 | -      | NM_001205252 | RNF223     |         -15152 |
| chr1  | 1007125 | 1009687 | -      | NM_001205252 | RNF223     |         -15152 |
| chr1  | 1017197 | 1051736 | -      | NM_017891    | C1orf159   |         -25224 |
| chr1  | 1017197 | 1051736 | -      | NM_017891    | C1orf159   |         -25224 |
| chr1  | 1017197 | 1051736 | -      | NM_017891    | C1orf159   |         -25224 |
| chr1  | 1072396 | 1079434 | +      | NR_038869    | LOC254099  |         -80423 |
| chr1  | 1102483 | 1102578 | +      | NR_029639    | MIR200B    |        -110510 |
| chr1  | 1103242 | 1103332 | +      | NR_029834    | MIR200A    |        -111269 |
| chr1  | 1104384 | 1104467 | +      | NR_029957    | MIR429     |        -112411 |
| chr1  | 1109285 | 1133313 | +      | NM_001130045 | TTLL10     |        -117312 |
+-------+---------+---------+--------+--------------+------------+----------------+