Fetch fasta sequence for identifier list: Difference between revisions
From genomewiki
Jump to navigationJump to search
No edit summary |
(changed cse to soe) |
||
(2 intermediate revisions by one other user not shown) | |||
Line 12: | Line 12: | ||
cat accessionList.txt | while read A | cat accessionList.txt | while read A | ||
do | do | ||
mysql --user=genome --host=genome-mysql. | mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -N \ | ||
-e 'select chrom,txStart,txEnd,name,0,strand | -e 'select chrom,txStart,txEnd,name,0,strand | ||
from refGene where name ="'${A}'";' hg18 | from refGene where name ="'${A}'";' hg18 | ||
Line 45: | Line 45: | ||
faRc -keepName -keepCase reverse.fa rev_compliment.fa | faRc -keepName -keepCase reverse.fa rev_compliment.fa | ||
</PRE> | </PRE> | ||
[[Category:Technical FAQ]] | |||
[[Category:User Developed Scripts]] |
Latest revision as of 07:56, 1 September 2018
The following script demonstrates the use of the public MySQL database to extract the information for a list of identifiers. Then using twoBitToFa with the hg18.2bit file to extract the sequence for that list of identifiers. And faRc to reverse complement the items that are on the negative strand. This can be used as an alternative to a table browser query when the list of identifiers is too long for the table browser to successfully complete the operation. Adjust the table used if the identifiers you are working with are from some other type of table rather than refGene.
#!/bin/sh ## script used to extract fasta sequence from list of identifiers ## in the refGene table ## given an accession list of identifiers known to be in the refGene table ## fetch the transcript coordinates from the refGene table ## score is zero for all items cat accessionList.txt | while read A do mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -N \ -e 'select chrom,txStart,txEnd,name,0,strand from refGene where name ="'${A}'";' hg18 done > accessionList.bed 2> sqlErrors.txt ## This does not capture any sqlErrors when the name is not found ? ## separate out the reverse and forward strand items, reformat to ## match twoBitToFa format chrN:start-stop, and with name included awk -F'\t' '/\+$/ {printf "%s:%d-%d\t%s\n", $1,$2,$3,$4}' accessionList.bed \ > forwardStrand.bed awk -F'\t' '/-$/ {printf "%s:%d-%d\t%s\n", $1,$2,$3,$4}' accessionList.bed \ > reverseStrand.bed ## fetch each sequence from the 2bit file, insert name in fasta output ## forward strand items cat forwardStrand.bed | while read N do POSITION=`echo "${N}" | cut -f1` NAME=`echo "${N}" | cut -f2` twoBitToFa hg18.2bit:${POSITION} stdout | sed -e "s/^>/>${NAME} /" done > forward.fa ## reverse strand items cat reverseStrand.bed | while read N do POSITION=`echo "${N}" | cut -f1` NAME=`echo "${N}" | cut -f2` twoBitToFa hg18.2bit:${POSITION} stdout | sed -e "s/^>/>${NAME} /" done > reverse.fa ## and finally, reverse compliment the reverse strand sequence faRc -keepName -keepCase reverse.fa rev_compliment.fa