Finding nearby genes: Difference between revisions

From genomewiki
Jump to navigationJump to search
mNo edit summary
(Template mysql command for finding nearest transcripts)
 
(30 intermediate revisions by one other user not shown)
Line 1: Line 1:
Let's say you had a position, and you wanted to find a sample
===Introduction===
of nearby genes upstream and downstream from this position.
If you are interested in a certain genomic position, or reference point, and you want to find a sample
of nearby genes upstream and downstream from this position, you can create a script by copying
one of the examples below. These scripts will find the nearest transcripts (upstream and downstream)
from your reference point, and report the gene name also. The last two scripts (for hg19 & hg38) will
also report the distance from the nearby transcripts and the reference point.  


* Open your editor on the command line and create a script in your bin directory.
* E.g.,
<PRE>
vi closestGene.sh
</PRE>
*Paste one of the scripts below into your closestGene.sh file
*Of course, make sure your script has the proper permissions to be executable:
<pre>
chmod +x closestGene.sh
</pre>
* Run the script.
<PRE>
closestGene.sh
</PRE>
* Scripts assume that you have MySQL access (installed MySQL and a .hg.conf file). [https://genome.ucsc.edu/goldenpath/help/mysql.html See more about the UCSC Genome Browser MySQL database].


This can be done with a MySQL query to the public MySQL server
===Alternatives===
====Galaxy====
* [http://main.g2.bx.psu.edu/ Galaxy] has a "Fetch closest non-overlapping feature" tool.
* Use the [http://genome.ucsc.edu/cgi-bin/hgTables Table Browser], select a track, and click on the link, "Galaxy" (next to "output format") to send output to Galaxy.
* Use the [https://biostar.usegalaxy.org/p/12021/#12083 "Fetch closest non-overlapping feature for every interval"] under the "Operate on Genomic Intervals" menu in Galaxy
* For questions using Galaxy, contact Galaxy help: https://biostar.usegalaxy.org/
* Galaxy posts that may help: [https://biostar.usegalaxy.org/p/4689/ post1] [https://biostar.usegalaxy.org/p/12021/#12083 post 2]
====BedTools====
* The BedTools include a tool [http://bedtools.readthedocs.io/en/latest/content/example-usage.html#bedtools-closest "closestBed"]
====Multi-Region====
* Use the [https://genome.ucsc.edu/goldenpath/help/multiRegionHelp.html "Multi-Region tool"] to remove, or "slice out" intergenic regions in the browser, allowing you to visualize a region with a "gene-only" (or exon-only) view. Currently, the multi-region option does not provide a way to download the gene-only or exon-only regions you are viewing in the browser.
 
==Template MySQL Query==
All of the below example scripts are just specialized or slightly modified versions of the following template MySQL command, where all of the variables within ${} are customizable parameters:
<pre>
mysql -h genome-mysql.soe.ucsc.edu -ugenome -A -e "select \
  table1.chrom, table1.${chromStart}, table1.${chromEnd}, table1.strand, table1.name, table2.name as geneSymbol from ${tblName1} table1,\
    ${tblName2} table2 where table1.name = table2.id AND table1.chrom='${chrom}' AND \
        ((table1.${chromStart} >= ${refStart} - ${range} AND table1.${chromStart} <= ${refEnd} + ${range}) OR \
        (table1.${chromEnd} >= ${refStart} - ${range} AND table1.${chromEnd} <= ${refEnd} + ${range})) \
  order by table1.${chromEnd} desc " $db
</pre>
 
The optional paramters are explained below, where the value after the '=' sign indicates an example value:
<pre>
chromStart="txStart"          # field name of the transcript start for the primary table
chromEnd="txEnd"              # field name of the transcript end for the primary table
tblName1="ncbiRefSeqCurated"  # primary table name that stores the transcript coordinates
tblName2="ncbiRefSeqLink"    # optional secondary table with geneSymbol information
chrom="chr1"                  # chromosome of interest
range="10000"                # optional range outside of interest point
refStart="166167154"          # start coordinate of range of interest
refEnd="166167602"            # end coordinate of range of interest
db="hg38"                    # database of interest
</pre>
The above query, with the above example values, finds all transcripts in the ncbiRefSeqCurated table within 10kb of chr1:166167154-166167602, which is an example enhancer region:
<pre>
+-------+-----------+-----------+--------+----------------+------------+
| chrom | txStart  | txEnd    | strand | name          | geneSymbol |
+-------+-----------+-----------+--------+----------------+------------+
| chr1  | 166055917 | 166166755 | -      | NR_135199.1    | FAM78B    |
| chr1  | 166055917 | 166166755 | -      | NM_001320302.1 | FAM78B    |
| chr1  | 166069298 | 166166755 | -      | NM_001017961.4 | FAM78B    |
+-------+-----------+-----------+--------+----------------+------------+
</pre>
Both the example query and the example parameters are intended to be directly pasted into a bash shell and/or modified to suit your needs. The example scripts below this page are all essentially variations on the above information, specialized for specific applications (downstream or upstream only, etc).
==Examples==
==="Nearest gene" script for knownGene on hg18===
* This script will find the closest transcripts to a reference point region.
* [http://genome.ucsc.edu/cgi-bin/hgTables?db=hg18&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene&hgta_doSchema=describe+table+schema View table schema for knownGene, hg18]


Alternatives:
* [http://main.g2.bx.psu.edu/ Galaxy] has a "Fetch closest non-overlapping feature" tool
* the BedTools include a tool "closestBed"
===Script for knownGene on hg18===
<PRE>
<PRE>
#!/bin/sh
#!/bin/sh
Line 51: Line 115:
+------+--------+--------+------------+----------+
+------+--------+--------+------------+----------+
</PRE>
</PRE>
===Script for ncbiRefSeq on hg38===
 
Here is a script for the gene set ncbiRefSeq on hg38:
==="Nearest gene" script for refGene on hg19===
* This script will find the closest transcripts to a reference point region for the gene set refGene on hg19.
 
* For this example,  [http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=cath&hgS_otherUserSessionName=MLQ19469 the output can be seen in this session], where the custom track labeled, "closest" are the regions in the MySQL output (the 10 closest transcripts upstream, and the 10 closest transcripts downstream). The other custom track, labeled, "distanceCheck" is derived from the last column in the SQL output, the number of bp that each transcript is from the reference point. This "distance" output is strand agnostic; we simply start from the reference point and count bp to the left or to the right until a transcript is reached - that point may be the 5' end or the 3' end depending on strand orientation.
 
* [http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_track=refGene&hgta_table=refGene&hgta_doSchema=describe+table+schema View the table schema for refGene, hg19]


<PRE>
<PRE>
#!/bin/sh
#!/bin/sh


# for gene set ncbiRefSeq
# for gene set refGene
# given position chr1:991973-991973
# given position chr1:991973-991973
# find a sample of genes near this upstream and downstream
# find a sample of genes near this upstream and downstream


# Input your assembly
# Input your assembly
G=hg38
G=hg19
# Input the chr for reference point
# Input the chr for reference point
C=chr1
C=chr1
Line 72: Line 141:
N=10
N=10


# This script uses the gene set refGene.
# Any gene set can be used. If a different gene set is used, check that
# Any gene set can be used. If a different gene set is used, check that
# the field names are the same, they may need updating. To check this,
# the field names are the same, they may need updating. To check this,
Line 79: Line 149:
# The last column is the distance from the comparison point.
# The last column is the distance from the comparison point.


 
echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for refGene set"
echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for ncbiRefSeq set"
echo "last column is distance from reference point to transcript, ${S} - txEnd"
echo "last column is distance from reference point to transcript, ${S} - txEnd"
echo "Note: for reverse - strand items, txEnd is the 5' end, the transcription \
echo "Note: for reverse - strand items, txEnd is the 5' end, the transcription \
start site"
start site"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.name,"'${S}'" - e.txEnd
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.geneSymbol,"'${S}'" - e.txEnd
AS "'${S}'-txEnd" FROM
AS "'${S}'-txEnd" FROM
   ncbiRefSeq e,
   refGene e,
   ncbiRefSeqLink j
   kgXref j
WHERE e.name = j.id AND e.chrom="'${C}'" AND e.txEnd < "'${S}'"
WHERE e.name = j.refseq AND e.chrom="'${C}'" AND e.txEnd < "'${S}'"
ORDER BY e.txEnd DESC limit '${N}';' $G
ORDER BY e.txEnd DESC limit 10;' $G


 
echo "closest downstream transcripts from ${C}:${S}-${E} in ${G} for refGene set"
echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for ncbiRefSeq set"
echo "last column is distance from reference point to transcript, ${E} - txStart"
echo "last column is distance from reference point to transcript, ${E} - txEnd"
echo "Note: for reverse - strand items, txStart is the 3' end, not transcription \
echo "Note: for reverse - strand items, txStart is the 3' end, not the transcription \
start site"
start site"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.name,"'${E}'" - e.txStart
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.geneSymbol,"'${E}'" - e.txStart
AS "'${E}'-txStart" FROM
AS "'${E}'-txStart" FROM
   ncbiRefSeq e,
   refGene e,
   ncbiRefSeqLink j
   kgXref j
WHERE e.name = j.id AND e.chrom="'${C}'" AND e.txStart > '${E}'
WHERE e.name = j.refseq AND e.chrom="'${C}'" AND e.txStart > '${E}'
ORDER BY e.txStart ASC limit '${N}';' $G
ORDER BY e.txStart ASC limit 10;' $G
</PRE>
</PRE>


Line 109: Line 177:


<PRE>
<PRE>
closest upstream transcripts from chr1:991973-991973 in hg38 for ncbiRefSeq set
closest upstream transcripts from chr1:991973-991973 in hg19 for refGene set
last column is distance from reference point to transcript, 991973 - txEnd
last column is distance from reference point to transcript, 991973 - txEnd
Note: for reverse - strand items, txEnd is the 5' end, the transcription start site
Note: for reverse - strand items, txEnd is the 5' end, the transcription start site
+-------+---------+--------+--------+----------------+---------+--------------+
+-------+---------+--------+--------+--------------+------------+--------------+
| chrom | txStart | txEnd  | strand | name           | name    | 991973-txEnd |
| chrom | txStart | txEnd  | strand | name         | geneSymbol | 991973-txEnd |
+-------+---------+--------+--------+----------------+---------+--------------+
+-------+---------+--------+--------+--------------+------------+--------------+
| chr1  |  975198 | 982117 | -     | NM_001291367.1 | PERM1  |         9856 |
| chr1  |  955502 | 991499 | +     | NM_198576    | AGRN      |         474 |
| chr1  |  975198 | 982117 | -     | NM_001291366.1 | PERM1  |         9856 |
| chr1  |  948846 | 949919 | +     | NM_005101    | ISG15      |       42054 |
| chr1  |  975198 | 982093 | -      | XM_017002583.1 | PERM1  |         9880 |
| chr1  |  934341 | 935552 | -      | NM_021170    | HES4      |       56421 |
| chr1  |  975198 | 982021 | -      | XM_017002584.1 | PERM1  |         9952 |
| chr1  |  934343 | 935552 | -      | NM_001142467 | HES4      |       56421 |
| chr1  |  975197 | 981657 | -     | XM_017002585.1 | PERM1  |        10316 |
| chr1  |  901876 | 910484 | +     | NM_032129    | PLEKHN1    |        81489 |
| chr1  |  966496 | 975108 | +      | NM_032129.2   | PLEKHN1 |        16865 |
| chr1  |  901876 | 910484 | +      | NM_032129    | PLEKHN1   |        81489 |
| chr1  |  966496 | 975108 | +      | NM_001160184.1 | PLEKHN1 |        16865 |
| chr1  |  901876 | 910484 | +      | NM_001160184 | PLEKHN1   |        81489 |
| chr1  |  965819 | 974587 | +      | XM_006710944.3 | PLEKHN1 |        17386 |
| chr1  |  895966 | 901099 | +      | NM_198317    | KLHL17    |        90874 |
| chr1  |  965819 | 974587 | +     | XM_017002476.1 | PLEKHN1 |        17386 |
| chr1  |  879582 | 894679 | -     | NM_015658    | NOC2L      |        97294 |
| chr1  |  965819 | 974587 | +     | XM_017002474.1 | PLEKHN1 |        17386 |
| chr1  |  879582 | 894679 | -     | NM_015658    | NOC2L      |        97294 |
+-------+---------+--------+--------+----------------+---------+--------------+
+-------+---------+--------+--------+--------------+------------+--------------+
closest upstream transcripts from chr1:991973-991973 in hg38 for ncbiRefSeq set
closest downstream transcripts from chr1:991973-991973 in hg19 for refGene set
last column is distance from reference point to transcript, 991973 - txEnd
last column is distance from reference point to transcript, 991973 - txStart
Note: for reverse - strand items, txStart is the 3' end, not the transcription start site
Note: for reverse - strand items, txStart is the 3' end, not transcription start site
+-------+---------+---------+--------+----------------+--------------+----------------+
+-------+---------+---------+--------+--------------+------------+----------------+
| chrom | txStart | txEnd  | strand | name           | name        | 991973-txStart |
| chrom | txStart | txEnd  | strand | name         | geneSymbol | 991973-txStart |
+-------+---------+---------+--------+----------------+--------------+----------------+
+-------+---------+---------+--------+--------------+------------+----------------+
| chr1  | 998961 | 1000172 | -      | NM_021170.3    | HES4         |          -6988 |
| chr1  | 1007125 | 1009687 | -      | NM_001205252 | RNF223    |        -15152 |
| chr1  | 998961 | 1001052 | -      | XM_005244771.4 | HES4         |          -6988 |
| chr1  | 1007125 | 1009687 | -      | NM_001205252 | RNF223    |        -15152 |
| chr1  | 998963 | 1000172 | -      | NM_001142467.1 | HES4         |          -6990 |
| chr1  | 1017197 | 1051736 | -      | NM_017891    | C1orf159  |        -25224 |
| chr1  | 1013466 | 1014540 | +     | NM_005101.3   | ISG15        |        -21493 |
| chr1  | 1017197 | 1051736 | -     | NM_017891   | C1orf159  |        -25224 |
| chr1  | 1020101 | 1056119 | +     | XM_011541429.2 | AGRN        |        -28128 |
| chr1  | 1017197 | 1051736 | -     | NM_017891    | C1orf159  |        -25224 |
| chr1  | 1020101 | 1056119 | +      | XR_946650.2   | AGRN        |        -28128 |
| chr1  | 1072396 | 1079434 | +      | NR_038869   | LOC254099  |        -80423 |
| chr1  | 1020101 | 1056119 | +      | XM_005244749.3 | AGRN        |         -28128 |
| chr1  | 1102483 | 1102578 | +      | NR_029639    | MIR200B    |       -110510 |
| chr1  | 1020122 | 1056119 | +      | NM_198576.3   | AGRN        |         -28149 |
| chr1  | 1103242 | 1103332 | +      | NR_029834   | MIR200A    |       -111269 |
| chr1  | 1020122 | 1056119 | +      | NM_001305275.1 | AGRN        |         -28149 |
| chr1  | 1104384 | 1104467 | +      | NR_029957    | MIR429    |       -112411 |
| chr1  | 1059706 | 1066441 | +      | XR_001737601.1 | LOC100288175 |         -67733 |
| chr1  | 1109285 | 1133313 | +      | NM_001130045 | TTLL10    |       -117312 |
+-------+---------+---------+--------+----------------+--------------+----------------+
+-------+---------+---------+--------+--------------+------------+----------------+
</PRE>
</PRE>


===Script for ncbiRefSeq on hg38===
==="Nearest gene" script for ncbiRefSeq on hg38===
Here is a script for the gene set refGene on hg19:
* * This script will find the closest transcripts to a reference point region for the gene set ncbiRefSeq on hg38.
 
* Note that the last column in the SQL output is the distance, or the number of bp that each transcript is, from the reference point. This "distance" output is strand agnostic; we simply start from the reference point and count bp to the left or to the right until a transcript is reached - that point may be the 5' end or the 3' end depending on strand orientation.
 
* [http://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=genes&hgta_track=refSeqComposite&hgta_table=ncbiRefSeq&hgta_doSchema=describe+table+schema View table schema for ncbiRefSeq, hg38]
 
<PRE>
<PRE>
#!/bin/sh
#!/bin/sh
Line 198: Line 271:
ORDER BY e.txStart ASC limit '${N}';' $G
ORDER BY e.txStart ASC limit '${N}';' $G
</PRE>
</PRE>
This produces the output:


<PRE>
<PRE>
closest upstream transcripts from chr1:991973-991973 in hg19 for refGene set
closest upstream transcripts from chr1:991973-991973 in hg38 for ncbiRefSeq set
last column is distance from reference point to transcript, 991973 - txEnd
last column is distance from reference point to transcript, 991973 - txEnd
Note: for reverse - strand items, txEnd is the 5' end, the transcription start site
Note: for reverse - strand items, txEnd is the 5' end, the transcription start site
+-------+---------+--------+--------+--------------+------------+--------------+
+-------+---------+--------+--------+----------------+---------+--------------+
| chrom | txStart | txEnd  | strand | name         | geneSymbol | 991973-txEnd |
| chrom | txStart | txEnd  | strand | name           | name    | 991973-txEnd |
+-------+---------+--------+--------+--------------+------------+--------------+
+-------+---------+--------+--------+----------------+---------+--------------+
| chr1  |  955502 | 991499 | +     | NM_198576    | AGRN      |         474 |
| chr1  |  975198 | 982117 | -     | NM_001291367.1 | PERM1  |         9856 |
| chr1  |  948846 | 949919 | +     | NM_005101    | ISG15      |       42054 |
| chr1  |  975198 | 982117 | -     | NM_001291366.1 | PERM1  |         9856 |
| chr1  |  934341 | 935552 | -      | NM_021170    | HES4      |       56421 |
| chr1  |  975198 | 982093 | -      | XM_017002583.1 | PERM1  |         9880 |
| chr1  |  934343 | 935552 | -      | NM_001142467 | HES4      |       56421 |
| chr1  |  975198 | 982021 | -      | XM_017002584.1 | PERM1  |         9952 |
| chr1  |  901876 | 910484 | +     | NM_032129    | PLEKHN1    |        81489 |
| chr1  |  975197 | 981657 | -     | XM_017002585.1 | PERM1  |        10316 |
| chr1  |  901876 | 910484 | +      | NM_032129    | PLEKHN1   |        81489 |
| chr1  |  966496 | 975108 | +      | NM_032129.2   | PLEKHN1 |        16865 |
| chr1  |  901876 | 910484 | +      | NM_001160184 | PLEKHN1   |        81489 |
| chr1  |  966496 | 975108 | +      | NM_001160184.1 | PLEKHN1 |        16865 |
| chr1  |  895966 | 901099 | +      | NM_198317    | KLHL17    |        90874 |
| chr1  |  965819 | 974587 | +      | XM_006710944.3 | PLEKHN1 |        17386 |
| chr1  |  879582 | 894679 | -     | NM_015658    | NOC2L      |        97294 |
| chr1  |  965819 | 974587 | +     | XM_017002476.1 | PLEKHN1 |        17386 |
| chr1  |  879582 | 894679 | -     | NM_015658    | NOC2L      |        97294 |
| chr1  |  965819 | 974587 | +     | XM_017002474.1 | PLEKHN1 |        17386 |
+-------+---------+--------+--------+--------------+------------+--------------+
+-------+---------+--------+--------+----------------+---------+--------------+
closest downstream transcripts from chr1:991973-991973 in hg19 for refGene set
closest downstream transcripts from chr1:991973-991973 in hg38 for ncbiRefSeq set
last column is distance from reference point to transcript, 991973 - txStart
last column is distance from reference point to transcript, 991973 - txEnd
Note: for reverse - strand items, txStart is the 3' end, not transcription start site
Note: for reverse - strand items, txStart is the 3' end, not the transcription start site
+-------+---------+---------+--------+--------------+------------+----------------+
+-------+---------+---------+--------+----------------+--------------+----------------+
| chrom | txStart | txEnd  | strand | name        | geneSymbol | 991973-txStart |
| chrom | txStart | txEnd  | strand | name          | name        | 991973-txStart |
+-------+---------+---------+--------+--------------+------------+----------------+
+-------+---------+---------+--------+----------------+--------------+----------------+
| chr1  | 1007125 | 1009687 | -      | NM_001205252 | RNF223    |         -15152 |
| chr1  | 998961 | 1000172 | -      | NM_021170.3    | HES4        |         -6988 |
| chr1  | 1007125 | 1009687 | -      | NM_001205252 | RNF223    |         -15152 |
| chr1  | 998961 | 1001052 | -      | XM_005244771.4 | HES4        |         -6988 |
| chr1  | 1017197 | 1051736 | -      | NM_017891    | C1orf159  |         -25224 |
| chr1  | 998963 | 1000172 | -      | NM_001142467.1 | HES4        |         -6990 |
| chr1  | 1017197 | 1051736 | -     | NM_017891   | C1orf159  |        -25224 |
| chr1  | 1013466 | 1014540 | +     | NM_005101.3   | ISG15        |        -21493 |
| chr1  | 1017197 | 1051736 | -     | NM_017891    | C1orf159  |        -25224 |
| chr1  | 1020101 | 1056119 | +     | XM_011541429.2 | AGRN        |        -28128 |
| chr1  | 1072396 | 1079434 | +      | NR_038869   | LOC254099  |        -80423 |
| chr1  | 1020101 | 1056119 | +      | XR_946650.2   | AGRN        |        -28128 |
| chr1  | 1102483 | 1102578 | +      | NR_029639    | MIR200B    |       -110510 |
| chr1  | 1020101 | 1056119 | +      | XM_005244749.3 | AGRN        |         -28128 |
| chr1  | 1103242 | 1103332 | +      | NR_029834   | MIR200A    |       -111269 |
| chr1  | 1020122 | 1056119 | +      | NM_198576.3   | AGRN        |         -28149 |
| chr1  | 1104384 | 1104467 | +      | NR_029957    | MIR429    |       -112411 |
| chr1  | 1020122 | 1056119 | +      | NM_001305275.1 | AGRN        |         -28149 |
| chr1  | 1109285 | 1133313 | +      | NM_001130045 | TTLL10    |       -117312 |
| chr1  | 1059706 | 1066441 | +      | XR_001737601.1 | LOC100288175 |         -67733 |
+-------+---------+---------+--------+--------------+------------+----------------+
+-------+---------+---------+--------+----------------+--------------+----------------+
</PRE>
</PRE>


[[Category:Technical FAQ]]
[[Category:Technical FAQ]]
[[Category:User Developed Scripts]]
[[Category:User Developed Scripts]]

Latest revision as of 15:24, 17 July 2018

Introduction

If you are interested in a certain genomic position, or reference point, and you want to find a sample of nearby genes upstream and downstream from this position, you can create a script by copying one of the examples below. These scripts will find the nearest transcripts (upstream and downstream) from your reference point, and report the gene name also. The last two scripts (for hg19 & hg38) will also report the distance from the nearby transcripts and the reference point.

  • Open your editor on the command line and create a script in your bin directory.
  • E.g.,
vi closestGene.sh
  • Paste one of the scripts below into your closestGene.sh file
  • Of course, make sure your script has the proper permissions to be executable:
chmod +x closestGene.sh
  • Run the script.
closestGene.sh

Alternatives

Galaxy

BedTools

Multi-Region

  • Use the "Multi-Region tool" to remove, or "slice out" intergenic regions in the browser, allowing you to visualize a region with a "gene-only" (or exon-only) view. Currently, the multi-region option does not provide a way to download the gene-only or exon-only regions you are viewing in the browser.

Template MySQL Query

All of the below example scripts are just specialized or slightly modified versions of the following template MySQL command, where all of the variables within ${} are customizable parameters:

mysql -h genome-mysql.soe.ucsc.edu -ugenome -A -e "select \
   table1.chrom, table1.${chromStart}, table1.${chromEnd}, table1.strand, table1.name, table2.name as geneSymbol from ${tblName1} table1,\
    ${tblName2} table2 where table1.name = table2.id AND table1.chrom='${chrom}' AND \
        ((table1.${chromStart} >= ${refStart} - ${range} AND table1.${chromStart} <= ${refEnd} + ${range}) OR \
        (table1.${chromEnd} >= ${refStart} - ${range} AND table1.${chromEnd} <= ${refEnd} + ${range})) \
  order by table1.${chromEnd} desc " $db

The optional paramters are explained below, where the value after the '=' sign indicates an example value:

chromStart="txStart"          # field name of the transcript start for the primary table
chromEnd="txEnd"              # field name of the transcript end for the primary table
tblName1="ncbiRefSeqCurated"  # primary table name that stores the transcript coordinates
tblName2="ncbiRefSeqLink"     # optional secondary table with geneSymbol information
chrom="chr1"                  # chromosome of interest
range="10000"                 # optional range outside of interest point
refStart="166167154"          # start coordinate of range of interest
refEnd="166167602"            # end coordinate of range of interest
db="hg38"                     # database of interest

The above query, with the above example values, finds all transcripts in the ncbiRefSeqCurated table within 10kb of chr1:166167154-166167602, which is an example enhancer region:

+-------+-----------+-----------+--------+----------------+------------+
| chrom | txStart   | txEnd     | strand | name           | geneSymbol |
+-------+-----------+-----------+--------+----------------+------------+
| chr1  | 166055917 | 166166755 | -      | NR_135199.1    | FAM78B     |
| chr1  | 166055917 | 166166755 | -      | NM_001320302.1 | FAM78B     |
| chr1  | 166069298 | 166166755 | -      | NM_001017961.4 | FAM78B     |
+-------+-----------+-----------+--------+----------------+------------+

Both the example query and the example parameters are intended to be directly pasted into a bash shell and/or modified to suit your needs. The example scripts below this page are all essentially variations on the above information, specialized for specific applications (downstream or upstream only, etc).

Examples

"Nearest gene" script for knownGene on hg18

#!/bin/sh

# given position chr1:710000-720000
# find a sample of genes near this upstream and downstream
C=chr1
S=710000
E=720000

echo "three upstream genes from ${C}:${S}-${E}"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -N -e \
'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM
   knownGene e,
   kgXref j
WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txEnd < '${S}'
ORDER BY e.txEnd DESC limit 3;' hg18

echo "three downstream genes from ${C}:${S}-${E}"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -N -e \
'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM
   knownGene e,
   kgXref j
WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txStart > '${E}'
ORDER BY e.txStart ASC limit 3;' hg18

This produces the output:

three upstream genes from chr1:710000-720000
+------+--------+--------+------------+----------+
| chr1 | 690107 | 703869 | uc001abo.1 | BC006361 |
| chr1 | 665195 | 665226 | uc001abn.1 | DQ599872 |
| chr1 | 665086 | 665147 | uc001abm.1 | DQ600587 |
+------+--------+--------+------------+----------+
three downstream genes from chr1:710000-720000
+------+--------+--------+------------+----------+
| chr1 | 752926 | 778860 | uc001abp.1 | BC102012 |
| chr1 | 752926 | 778860 | uc001abq.1 | BC042880 |
| chr1 | 752926 | 779603 | uc001abr.1 | CR601056 |
+------+--------+--------+------------+----------+

"Nearest gene" script for refGene on hg19

  • This script will find the closest transcripts to a reference point region for the gene set refGene on hg19.
  • For this example, the output can be seen in this session, where the custom track labeled, "closest" are the regions in the MySQL output (the 10 closest transcripts upstream, and the 10 closest transcripts downstream). The other custom track, labeled, "distanceCheck" is derived from the last column in the SQL output, the number of bp that each transcript is from the reference point. This "distance" output is strand agnostic; we simply start from the reference point and count bp to the left or to the right until a transcript is reached - that point may be the 5' end or the 3' end depending on strand orientation.
#!/bin/sh

# for gene set refGene
# given position chr1:991973-991973
# find a sample of genes near this upstream and downstream

# Input your assembly
G=hg19
# Input the chr for reference point
C=chr1
# Input start for reference point
S=991973
# Input end for reference point
E=991973
# Input the number of nearby transcripts to output
N=10

# This script uses the gene set refGene.
# Any gene set can be used. If a different gene set is used, check that
# the field names are the same, they may need updating. To check this,
# go to the Table Browser, select your gene set, and click the link for
# "table schema" to see field names. Older assemblies may use the related
# kgXref table for gene alias/gene name.
# The last column is the distance from the comparison point.

echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for refGene set"
echo "last column is distance from reference point to transcript, ${S} - txEnd"
echo "Note: for reverse - strand items, txEnd is the 5' end, the transcription \
start site"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.geneSymbol,"'${S}'" - e.txEnd
AS "'${S}'-txEnd" FROM
   refGene e,
   kgXref j
WHERE e.name = j.refseq AND e.chrom="'${C}'" AND e.txEnd < "'${S}'"
ORDER BY e.txEnd DESC limit 10;' $G

echo "closest downstream transcripts from ${C}:${S}-${E} in ${G} for refGene set"
echo "last column is distance from reference point to transcript, ${E} - txStart"
echo "Note: for reverse - strand items, txStart is the 3' end, not transcription \
start site"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.geneSymbol,"'${E}'" - e.txStart
AS "'${E}'-txStart" FROM
   refGene e,
   kgXref j
WHERE e.name = j.refseq AND e.chrom="'${C}'" AND e.txStart > '${E}'
ORDER BY e.txStart ASC limit 10;' $G

This produces the output:

closest upstream transcripts from chr1:991973-991973 in hg19 for refGene set
last column is distance from reference point to transcript, 991973 - txEnd
Note: for reverse - strand items, txEnd is the 5' end, the transcription start site
+-------+---------+--------+--------+--------------+------------+--------------+
| chrom | txStart | txEnd  | strand | name         | geneSymbol | 991973-txEnd |
+-------+---------+--------+--------+--------------+------------+--------------+
| chr1  |  955502 | 991499 | +      | NM_198576    | AGRN       |          474 |
| chr1  |  948846 | 949919 | +      | NM_005101    | ISG15      |        42054 |
| chr1  |  934341 | 935552 | -      | NM_021170    | HES4       |        56421 |
| chr1  |  934343 | 935552 | -      | NM_001142467 | HES4       |        56421 |
| chr1  |  901876 | 910484 | +      | NM_032129    | PLEKHN1    |        81489 |
| chr1  |  901876 | 910484 | +      | NM_032129    | PLEKHN1    |        81489 |
| chr1  |  901876 | 910484 | +      | NM_001160184 | PLEKHN1    |        81489 |
| chr1  |  895966 | 901099 | +      | NM_198317    | KLHL17     |        90874 |
| chr1  |  879582 | 894679 | -      | NM_015658    | NOC2L      |        97294 |
| chr1  |  879582 | 894679 | -      | NM_015658    | NOC2L      |        97294 |
+-------+---------+--------+--------+--------------+------------+--------------+
closest downstream transcripts from chr1:991973-991973 in hg19 for refGene set
last column is distance from reference point to transcript, 991973 - txStart
Note: for reverse - strand items, txStart is the 3' end, not transcription start site
+-------+---------+---------+--------+--------------+------------+----------------+
| chrom | txStart | txEnd   | strand | name         | geneSymbol | 991973-txStart |
+-------+---------+---------+--------+--------------+------------+----------------+
| chr1  | 1007125 | 1009687 | -      | NM_001205252 | RNF223     |         -15152 |
| chr1  | 1007125 | 1009687 | -      | NM_001205252 | RNF223     |         -15152 |
| chr1  | 1017197 | 1051736 | -      | NM_017891    | C1orf159   |         -25224 |
| chr1  | 1017197 | 1051736 | -      | NM_017891    | C1orf159   |         -25224 |
| chr1  | 1017197 | 1051736 | -      | NM_017891    | C1orf159   |         -25224 |
| chr1  | 1072396 | 1079434 | +      | NR_038869    | LOC254099  |         -80423 |
| chr1  | 1102483 | 1102578 | +      | NR_029639    | MIR200B    |        -110510 |
| chr1  | 1103242 | 1103332 | +      | NR_029834    | MIR200A    |        -111269 |
| chr1  | 1104384 | 1104467 | +      | NR_029957    | MIR429     |        -112411 |
| chr1  | 1109285 | 1133313 | +      | NM_001130045 | TTLL10     |        -117312 |
+-------+---------+---------+--------+--------------+------------+----------------+

"Nearest gene" script for ncbiRefSeq on hg38

  • * This script will find the closest transcripts to a reference point region for the gene set ncbiRefSeq on hg38.
  • Note that the last column in the SQL output is the distance, or the number of bp that each transcript is, from the reference point. This "distance" output is strand agnostic; we simply start from the reference point and count bp to the left or to the right until a transcript is reached - that point may be the 5' end or the 3' end depending on strand orientation.
#!/bin/sh

# for gene set ncbiRefSeq
# given position chr1:991973-991973
# find a sample of genes near this upstream and downstream

# Input your assembly
G=hg38
# Input the chr for reference point
C=chr1
# Input start for reference point
S=991973
# Input end for reference point
E=991973
# Input the number of nearby transcripts to output
N=10

# Any gene set can be used. If a different gene set is used, check that
# the field names are the same, they may need updating. To check this,
# go to the Table Browser, select your gene set, and click the link for
# "table schema" to see field names. Older assemblies may use the related
# kgXref table for gene alias/gene name.
# The last column is the distance from the comparison point.


echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for ncbiRefSeq set"
echo "last column is distance from reference point to transcript, ${S} - txEnd"
echo "Note: for reverse - strand items, txEnd is the 5' end, the transcription \
start site"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.name,"'${S}'" - e.txEnd
AS "'${S}'-txEnd" FROM
   ncbiRefSeq e,
   ncbiRefSeqLink j
WHERE e.name = j.id AND e.chrom="'${C}'" AND e.txEnd < "'${S}'"
ORDER BY e.txEnd DESC limit '${N}';' $G


echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for ncbiRefSeq set"
echo "last column is distance from reference point to transcript, ${E} - txEnd"
echo "Note: for reverse - strand items, txStart is the 3' end, not the transcription \
start site"
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.name,"'${E}'" - e.txStart
AS "'${E}'-txStart" FROM
   ncbiRefSeq e,
   ncbiRefSeqLink j
WHERE e.name = j.id AND e.chrom="'${C}'" AND e.txStart > '${E}'
ORDER BY e.txStart ASC limit '${N}';' $G

This produces the output:

closest upstream transcripts from chr1:991973-991973 in hg38 for ncbiRefSeq set
last column is distance from reference point to transcript, 991973 - txEnd
Note: for reverse - strand items, txEnd is the 5' end, the transcription start site
+-------+---------+--------+--------+----------------+---------+--------------+
| chrom | txStart | txEnd  | strand | name           | name    | 991973-txEnd |
+-------+---------+--------+--------+----------------+---------+--------------+
| chr1  |  975198 | 982117 | -      | NM_001291367.1 | PERM1   |         9856 |
| chr1  |  975198 | 982117 | -      | NM_001291366.1 | PERM1   |         9856 |
| chr1  |  975198 | 982093 | -      | XM_017002583.1 | PERM1   |         9880 |
| chr1  |  975198 | 982021 | -      | XM_017002584.1 | PERM1   |         9952 |
| chr1  |  975197 | 981657 | -      | XM_017002585.1 | PERM1   |        10316 |
| chr1  |  966496 | 975108 | +      | NM_032129.2    | PLEKHN1 |        16865 |
| chr1  |  966496 | 975108 | +      | NM_001160184.1 | PLEKHN1 |        16865 |
| chr1  |  965819 | 974587 | +      | XM_006710944.3 | PLEKHN1 |        17386 |
| chr1  |  965819 | 974587 | +      | XM_017002476.1 | PLEKHN1 |        17386 |
| chr1  |  965819 | 974587 | +      | XM_017002474.1 | PLEKHN1 |        17386 |
+-------+---------+--------+--------+----------------+---------+--------------+
closest downstream transcripts from chr1:991973-991973 in hg38 for ncbiRefSeq set
last column is distance from reference point to transcript, 991973 - txEnd
Note: for reverse - strand items, txStart is the 3' end, not the transcription start site
+-------+---------+---------+--------+----------------+--------------+----------------+
| chrom | txStart | txEnd   | strand | name           | name         | 991973-txStart |
+-------+---------+---------+--------+----------------+--------------+----------------+
| chr1  |  998961 | 1000172 | -      | NM_021170.3    | HES4         |          -6988 |
| chr1  |  998961 | 1001052 | -      | XM_005244771.4 | HES4         |          -6988 |
| chr1  |  998963 | 1000172 | -      | NM_001142467.1 | HES4         |          -6990 |
| chr1  | 1013466 | 1014540 | +      | NM_005101.3    | ISG15        |         -21493 |
| chr1  | 1020101 | 1056119 | +      | XM_011541429.2 | AGRN         |         -28128 |
| chr1  | 1020101 | 1056119 | +      | XR_946650.2    | AGRN         |         -28128 |
| chr1  | 1020101 | 1056119 | +      | XM_005244749.3 | AGRN         |         -28128 |
| chr1  | 1020122 | 1056119 | +      | NM_198576.3    | AGRN         |         -28149 |
| chr1  | 1020122 | 1056119 | +      | NM_001305275.1 | AGRN         |         -28149 |
| chr1  | 1059706 | 1066441 | +      | XR_001737601.1 | LOC100288175 |         -67733 |
+-------+---------+---------+--------+----------------+--------------+----------------+