Finding nearby genes: Difference between revisions
From genomewiki
Jump to navigationJump to search
Cath Tyner (talk | contribs) mNo edit summary |
Cath Tyner (talk | contribs) mNo edit summary |
||
Line 65: | Line 65: | ||
+------+--------+--------+------------+----------+ | +------+--------+--------+------------+----------+ | ||
</PRE> | </PRE> | ||
===Script for refGene on hg19=== | |||
Here is a script for the gene set refGene on hg19. | |||
[http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_track=refGene&hgta_table=refGene&hgta_doSchema=describe+table+schema View the table schema for refGene, hg19] | |||
<PRE> | |||
#!/bin/sh | |||
# for gene set refGene | |||
# given position chr1:991973-991973 | |||
# find a sample of genes near this upstream and downstream | |||
# Input your assembly | |||
G=hg19 | |||
# Input the chr for reference point | |||
C=chr1 | |||
# Input start for reference point | |||
S=991973 | |||
# Input end for reference point | |||
E=991973 | |||
# Input the number of nearby transcripts to output | |||
N=10 | |||
# This script uses the gene set refGene. | |||
# Any gene set can be used. If a different gene set is used, check that | |||
# the field names are the same, they may need updating. To check this, | |||
# go to the Table Browser, select your gene set, and click the link for | |||
# "table schema" to see field names. Older assemblies may use the related | |||
# kgXref table for gene alias/gene name. | |||
# The last column is the distance from the comparison point. | |||
echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for refGene set" | |||
echo "last column is distance from reference point to transcript, ${S} - txEnd" | |||
echo "Note: for reverse - strand items, txEnd is the 5' end, the transcription \ | |||
start site" | |||
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \ | |||
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.geneSymbol,"'${S}'" - e.txEnd | |||
AS "'${S}'-txEnd" FROM | |||
refGene e, | |||
kgXref j | |||
WHERE e.name = j.refseq AND e.chrom="'${C}'" AND e.txEnd < "'${S}'" | |||
ORDER BY e.txEnd DESC limit 10;' $G | |||
echo "closest downstream transcripts from ${C}:${S}-${E} in ${G} for refGene set" | |||
echo "last column is distance from reference point to transcript, ${E} - txStart" | |||
echo "Note: for reverse - strand items, txStart is the 3' end, not transcription \ | |||
start site" | |||
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \ | |||
'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.geneSymbol,"'${E}'" - e.txStart | |||
AS "'${E}'-txStart" FROM | |||
refGene e, | |||
kgXref j | |||
WHERE e.name = j.refseq AND e.chrom="'${C}'" AND e.txStart > '${E}' | |||
ORDER BY e.txStart ASC limit 10;' $G | |||
</PRE> | |||
<PRE> | |||
closest upstream transcripts from chr1:991973-991973 in hg19 for refGene set | |||
last column is distance from reference point to transcript, 991973 - txEnd | |||
Note: for reverse - strand items, txEnd is the 5' end, the transcription start site | |||
+-------+---------+--------+--------+--------------+------------+--------------+ | |||
| chrom | txStart | txEnd | strand | name | geneSymbol | 991973-txEnd | | |||
+-------+---------+--------+--------+--------------+------------+--------------+ | |||
| chr1 | 955502 | 991499 | + | NM_198576 | AGRN | 474 | | |||
| chr1 | 948846 | 949919 | + | NM_005101 | ISG15 | 42054 | | |||
| chr1 | 934341 | 935552 | - | NM_021170 | HES4 | 56421 | | |||
| chr1 | 934343 | 935552 | - | NM_001142467 | HES4 | 56421 | | |||
| chr1 | 901876 | 910484 | + | NM_032129 | PLEKHN1 | 81489 | | |||
| chr1 | 901876 | 910484 | + | NM_032129 | PLEKHN1 | 81489 | | |||
| chr1 | 901876 | 910484 | + | NM_001160184 | PLEKHN1 | 81489 | | |||
| chr1 | 895966 | 901099 | + | NM_198317 | KLHL17 | 90874 | | |||
| chr1 | 879582 | 894679 | - | NM_015658 | NOC2L | 97294 | | |||
| chr1 | 879582 | 894679 | - | NM_015658 | NOC2L | 97294 | | |||
+-------+---------+--------+--------+--------------+------------+--------------+ | |||
closest downstream transcripts from chr1:991973-991973 in hg19 for refGene set | |||
last column is distance from reference point to transcript, 991973 - txStart | |||
Note: for reverse - strand items, txStart is the 3' end, not transcription start site | |||
+-------+---------+---------+--------+--------------+------------+----------------+ | |||
| chrom | txStart | txEnd | strand | name | geneSymbol | 991973-txStart | | |||
+-------+---------+---------+--------+--------------+------------+----------------+ | |||
| chr1 | 1007125 | 1009687 | - | NM_001205252 | RNF223 | -15152 | | |||
| chr1 | 1007125 | 1009687 | - | NM_001205252 | RNF223 | -15152 | | |||
| chr1 | 1017197 | 1051736 | - | NM_017891 | C1orf159 | -25224 | | |||
| chr1 | 1017197 | 1051736 | - | NM_017891 | C1orf159 | -25224 | | |||
| chr1 | 1017197 | 1051736 | - | NM_017891 | C1orf159 | -25224 | | |||
| chr1 | 1072396 | 1079434 | + | NR_038869 | LOC254099 | -80423 | | |||
| chr1 | 1102483 | 1102578 | + | NR_029639 | MIR200B | -110510 | | |||
| chr1 | 1103242 | 1103332 | + | NR_029834 | MIR200A | -111269 | | |||
| chr1 | 1104384 | 1104467 | + | NR_029957 | MIR429 | -112411 | | |||
| chr1 | 1109285 | 1133313 | + | NM_001130045 | TTLL10 | -117312 | | |||
+-------+---------+---------+--------+--------------+------------+----------------+ | |||
</PRE> | |||
===Script for ncbiRefSeq on hg38=== | ===Script for ncbiRefSeq on hg38=== | ||
Line 162: | Line 257: | ||
</PRE> | </PRE> | ||
[[Category:Technical FAQ]] | [[Category:Technical FAQ]] | ||
[[Category:User Developed Scripts]] | [[Category:User Developed Scripts]] |
Revision as of 19:22, 26 May 2017
Introduction
If you are interested in a certain genomic position, or reference point, and you want to find a sample of nearby genes upstream and downstream from this position, you can create a script by copying one of the examples below.
- Open your editor on the command line and create a script in your bin directory.
- E.g.,
vi closestGene.sh
- Paste one of the scripts below into your closestGene.sh file
- Run the script.
closestGene.sh
- Scripts assume that you have MySQL access. See more about the UCSC Genome Browser MySQL database.
Alternatives
- Galaxy has a "Fetch closest non-overlapping feature" tool
- the BedTools include a tool "closestBed"
Script for knownGene on hg18
View table schema for knownGene, hg18
#!/bin/sh # given position chr1:710000-720000 # find a sample of genes near this upstream and downstream C=chr1 S=710000 E=720000 echo "three upstream genes from ${C}:${S}-${E}" mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -N -e \ 'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM knownGene e, kgXref j WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txEnd < '${S}' ORDER BY e.txEnd DESC limit 3;' hg18 echo "three downstream genes from ${C}:${S}-${E}" mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -N -e \ 'select e.chrom,e.txStart,e.txEnd,e.alignID,j.geneSymbol FROM knownGene e, kgXref j WHERE e.alignID = j.kgID AND e.chrom="'${C}'" AND e.txStart > '${E}' ORDER BY e.txStart ASC limit 3;' hg18
This produces the output:
three upstream genes from chr1:710000-720000 +------+--------+--------+------------+----------+ | chr1 | 690107 | 703869 | uc001abo.1 | BC006361 | | chr1 | 665195 | 665226 | uc001abn.1 | DQ599872 | | chr1 | 665086 | 665147 | uc001abm.1 | DQ600587 | +------+--------+--------+------------+----------+ three downstream genes from chr1:710000-720000 +------+--------+--------+------------+----------+ | chr1 | 752926 | 778860 | uc001abp.1 | BC102012 | | chr1 | 752926 | 778860 | uc001abq.1 | BC042880 | | chr1 | 752926 | 779603 | uc001abr.1 | CR601056 | +------+--------+--------+------------+----------+
Script for refGene on hg19
Here is a script for the gene set refGene on hg19.
View the table schema for refGene, hg19
#!/bin/sh # for gene set refGene # given position chr1:991973-991973 # find a sample of genes near this upstream and downstream # Input your assembly G=hg19 # Input the chr for reference point C=chr1 # Input start for reference point S=991973 # Input end for reference point E=991973 # Input the number of nearby transcripts to output N=10 # This script uses the gene set refGene. # Any gene set can be used. If a different gene set is used, check that # the field names are the same, they may need updating. To check this, # go to the Table Browser, select your gene set, and click the link for # "table schema" to see field names. Older assemblies may use the related # kgXref table for gene alias/gene name. # The last column is the distance from the comparison point. echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for refGene set" echo "last column is distance from reference point to transcript, ${S} - txEnd" echo "Note: for reverse - strand items, txEnd is the 5' end, the transcription \ start site" mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \ 'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.geneSymbol,"'${S}'" - e.txEnd AS "'${S}'-txEnd" FROM refGene e, kgXref j WHERE e.name = j.refseq AND e.chrom="'${C}'" AND e.txEnd < "'${S}'" ORDER BY e.txEnd DESC limit 10;' $G echo "closest downstream transcripts from ${C}:${S}-${E} in ${G} for refGene set" echo "last column is distance from reference point to transcript, ${E} - txStart" echo "Note: for reverse - strand items, txStart is the 3' end, not transcription \ start site" mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \ 'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.geneSymbol,"'${E}'" - e.txStart AS "'${E}'-txStart" FROM refGene e, kgXref j WHERE e.name = j.refseq AND e.chrom="'${C}'" AND e.txStart > '${E}' ORDER BY e.txStart ASC limit 10;' $G
closest upstream transcripts from chr1:991973-991973 in hg19 for refGene set last column is distance from reference point to transcript, 991973 - txEnd Note: for reverse - strand items, txEnd is the 5' end, the transcription start site +-------+---------+--------+--------+--------------+------------+--------------+ | chrom | txStart | txEnd | strand | name | geneSymbol | 991973-txEnd | +-------+---------+--------+--------+--------------+------------+--------------+ | chr1 | 955502 | 991499 | + | NM_198576 | AGRN | 474 | | chr1 | 948846 | 949919 | + | NM_005101 | ISG15 | 42054 | | chr1 | 934341 | 935552 | - | NM_021170 | HES4 | 56421 | | chr1 | 934343 | 935552 | - | NM_001142467 | HES4 | 56421 | | chr1 | 901876 | 910484 | + | NM_032129 | PLEKHN1 | 81489 | | chr1 | 901876 | 910484 | + | NM_032129 | PLEKHN1 | 81489 | | chr1 | 901876 | 910484 | + | NM_001160184 | PLEKHN1 | 81489 | | chr1 | 895966 | 901099 | + | NM_198317 | KLHL17 | 90874 | | chr1 | 879582 | 894679 | - | NM_015658 | NOC2L | 97294 | | chr1 | 879582 | 894679 | - | NM_015658 | NOC2L | 97294 | +-------+---------+--------+--------+--------------+------------+--------------+ closest downstream transcripts from chr1:991973-991973 in hg19 for refGene set last column is distance from reference point to transcript, 991973 - txStart Note: for reverse - strand items, txStart is the 3' end, not transcription start site +-------+---------+---------+--------+--------------+------------+----------------+ | chrom | txStart | txEnd | strand | name | geneSymbol | 991973-txStart | +-------+---------+---------+--------+--------------+------------+----------------+ | chr1 | 1007125 | 1009687 | - | NM_001205252 | RNF223 | -15152 | | chr1 | 1007125 | 1009687 | - | NM_001205252 | RNF223 | -15152 | | chr1 | 1017197 | 1051736 | - | NM_017891 | C1orf159 | -25224 | | chr1 | 1017197 | 1051736 | - | NM_017891 | C1orf159 | -25224 | | chr1 | 1017197 | 1051736 | - | NM_017891 | C1orf159 | -25224 | | chr1 | 1072396 | 1079434 | + | NR_038869 | LOC254099 | -80423 | | chr1 | 1102483 | 1102578 | + | NR_029639 | MIR200B | -110510 | | chr1 | 1103242 | 1103332 | + | NR_029834 | MIR200A | -111269 | | chr1 | 1104384 | 1104467 | + | NR_029957 | MIR429 | -112411 | | chr1 | 1109285 | 1133313 | + | NM_001130045 | TTLL10 | -117312 | +-------+---------+---------+--------+--------------+------------+----------------+
Script for ncbiRefSeq on hg38
Here is a script for the gene set ncbiRefSeq on hg38.
View table schema for ncbiRefSeq, hg38
#!/bin/sh # for gene set ncbiRefSeq # given position chr1:991973-991973 # find a sample of genes near this upstream and downstream # Input your assembly G=hg38 # Input the chr for reference point C=chr1 # Input start for reference point S=991973 # Input end for reference point E=991973 # Input the number of nearby transcripts to output N=10 # Any gene set can be used. If a different gene set is used, check that # the field names are the same, they may need updating. To check this, # go to the Table Browser, select your gene set, and click the link for # "table schema" to see field names. Older assemblies may use the related # kgXref table for gene alias/gene name. # The last column is the distance from the comparison point. echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for ncbiRefSeq set" echo "last column is distance from reference point to transcript, ${S} - txEnd" echo "Note: for reverse - strand items, txEnd is the 5' end, the transcription \ start site" mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \ 'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.name,"'${S}'" - e.txEnd AS "'${S}'-txEnd" FROM ncbiRefSeq e, ncbiRefSeqLink j WHERE e.name = j.id AND e.chrom="'${C}'" AND e.txEnd < "'${S}'" ORDER BY e.txEnd DESC limit '${N}';' $G echo "closest upstream transcripts from ${C}:${S}-${E} in ${G} for ncbiRefSeq set" echo "last column is distance from reference point to transcript, ${E} - txEnd" echo "Note: for reverse - strand items, txStart is the 3' end, not the transcription \ start site" mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -e \ 'select e.chrom,e.txStart,e.txEnd,e.strand,e.name,j.name,"'${E}'" - e.txStart AS "'${E}'-txStart" FROM ncbiRefSeq e, ncbiRefSeqLink j WHERE e.name = j.id AND e.chrom="'${C}'" AND e.txStart > '${E}' ORDER BY e.txStart ASC limit '${N}';' $G
This produces the output:
closest upstream transcripts from chr1:991973-991973 in hg38 for ncbiRefSeq set last column is distance from reference point to transcript, 991973 - txEnd Note: for reverse - strand items, txEnd is the 5' end, the transcription start site +-------+---------+--------+--------+----------------+---------+--------------+ | chrom | txStart | txEnd | strand | name | name | 991973-txEnd | +-------+---------+--------+--------+----------------+---------+--------------+ | chr1 | 975198 | 982117 | - | NM_001291367.1 | PERM1 | 9856 | | chr1 | 975198 | 982117 | - | NM_001291366.1 | PERM1 | 9856 | | chr1 | 975198 | 982093 | - | XM_017002583.1 | PERM1 | 9880 | | chr1 | 975198 | 982021 | - | XM_017002584.1 | PERM1 | 9952 | | chr1 | 975197 | 981657 | - | XM_017002585.1 | PERM1 | 10316 | | chr1 | 966496 | 975108 | + | NM_032129.2 | PLEKHN1 | 16865 | | chr1 | 966496 | 975108 | + | NM_001160184.1 | PLEKHN1 | 16865 | | chr1 | 965819 | 974587 | + | XM_006710944.3 | PLEKHN1 | 17386 | | chr1 | 965819 | 974587 | + | XM_017002476.1 | PLEKHN1 | 17386 | | chr1 | 965819 | 974587 | + | XM_017002474.1 | PLEKHN1 | 17386 | +-------+---------+--------+--------+----------------+---------+--------------+ closest upstream transcripts from chr1:991973-991973 in hg38 for ncbiRefSeq set last column is distance from reference point to transcript, 991973 - txEnd Note: for reverse - strand items, txStart is the 3' end, not the transcription start site +-------+---------+---------+--------+----------------+--------------+----------------+ | chrom | txStart | txEnd | strand | name | name | 991973-txStart | +-------+---------+---------+--------+----------------+--------------+----------------+ | chr1 | 998961 | 1000172 | - | NM_021170.3 | HES4 | -6988 | | chr1 | 998961 | 1001052 | - | XM_005244771.4 | HES4 | -6988 | | chr1 | 998963 | 1000172 | - | NM_001142467.1 | HES4 | -6990 | | chr1 | 1013466 | 1014540 | + | NM_005101.3 | ISG15 | -21493 | | chr1 | 1020101 | 1056119 | + | XM_011541429.2 | AGRN | -28128 | | chr1 | 1020101 | 1056119 | + | XR_946650.2 | AGRN | -28128 | | chr1 | 1020101 | 1056119 | + | XM_005244749.3 | AGRN | -28128 | | chr1 | 1020122 | 1056119 | + | NM_198576.3 | AGRN | -28149 | | chr1 | 1020122 | 1056119 | + | NM_001305275.1 | AGRN | -28149 | | chr1 | 1059706 | 1066441 | + | XR_001737601.1 | LOC100288175 | -67733 | +-------+---------+---------+--------+----------------+--------------+----------------+