Building a new genome database: Difference between revisions
m (pointer to AGP file specification) |
(Adding link to tdb docs #24956) |
||
(9 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
Prerequisites: you have the kent source tree checked out in your home directory <EM>~/kent/src/</EM> | Prerequisites: you have the kent source tree checked out in your home directory <EM>~/kent/src/</EM> | ||
and you are familiar with the contents of the README files in <EM>~/kent/src/product/README.*</EM><BR> | and you are familiar with the contents of the README files in <EM>~/kent/src/product/README.*</EM><BR> | ||
You have built all of the utilities (see those README files: [http://genome-source.soe.ucsc.edu/gitlist/kent.git/tree/master/src/product src/product] | |||
* Executable and Source Code Downloads -> http://hgdownload.soe.ucsc.edu/downloads.html#source_downloads | |||
* Utilities, README -> http://genome-source.soe.ucsc.edu/gitlist/kent.git/blob/master/src/userApps/README | |||
1. Organize your work. Decide on a database name. UCSC bases our names on the binomial nomenclature. The UCSC naming scheme is abcDef1 where abc is the first three letters of the genus, Def is the first three letters of the species, the 1 is the version of the genome. Versions start at 1. At UCSC we use a symlink from /cluster/data/abcDef1 -> to a filesystem that has enough data space to contain the build. Thus, all genome builds can be found from /cluster/data/ despite their actual NFS filesystem location. For this discussion, our new genome name will be simply: abcDef1. A few files are kept in /cluster/data/abcDef1/ but most work files for tracks are kept in /cluster/data/abcDef1/bed/trackName/ | 1. Organize your work. Decide on a database name. UCSC bases our names on the binomial nomenclature. The UCSC naming scheme is abcDef1 where abc is the first three letters of the genus, Def is the first three letters of the species, the 1 is the version of the genome. Versions start at 1. At UCSC we use a symlink from /cluster/data/abcDef1 -> to a filesystem that has enough data space to contain the build. Thus, all genome builds can be found from /cluster/data/ despite their actual NFS filesystem location. For this discussion, our new genome name will be simply: abcDef1. A few files are kept in /cluster/data/abcDef1/ but most work files for tracks are kept in /cluster/data/abcDef1/bed/trackName/ | ||
Line 26: | Line 29: | ||
6. Start your new database: | 6. Start your new database: | ||
(you may need to update your MySQL database permissions to allow access to your new database) | |||
$ hgsql -e "create database abcDef1;" mysql | $ hgsql -e "create database abcDef1;" mysql | ||
Line 40: | Line 44: | ||
10. Generate the gc5Base data and load table: | 10. Generate the gc5Base data and load table: | ||
$ mkdir bed/gc5Base | $ mkdir bed/gc5Base | ||
$ cd bed/gc5Base | |||
$ hgGcPercent -wigOut -doGaps -file=stdout -win=5 -verbose=0 abcDef1 \ | $ hgGcPercent -wigOut -doGaps -file=stdout -win=5 -verbose=0 abcDef1 \ | ||
abcDef1.2bit | wigEncode stdin | ../../abcDef1.2bit | wigEncode stdin gc5Base.{wig,wib} | ||
$ hgLoadWiggle -pathPrefix=/gbdb/abcDef1/wib \ | $ hgLoadWiggle -pathPrefix=/gbdb/abcDef1/wib \ | ||
abcDef1 gc5Base | abcDef1 gc5Base gc5Base.wig | ||
$ mkdir /gbdb/abcDef1/wib | $ mkdir /gbdb/abcDef1/wib | ||
$ ln -s `pwd` | $ ln -s `pwd`/gc5Base.wib /gbdb/abcDef1/wib | ||
11. Create the dbDb SQL insert statement. The orderKey is determined from existing dbDb entries. This creates the order of the pulldown menus in the gateway page. Place this into a file: dbDbInsert.sql and load it with the command: <EM>hgsql hgcentral < dbDbInsert.sql</EM> | 11. Create the dbDb SQL insert statement. The orderKey is determined from existing dbDb entries. This creates the order of the pulldown menus in the gateway page. Place this into a file: dbDbInsert.sql and load it with the command: <EM>hgsql hgcentral < dbDbInsert.sql</EM> | ||
Line 53: | Line 58: | ||
(name, description, nibPath, organism, | (name, description, nibPath, organism, | ||
defaultPos, active, orderKey, genome, scientificName, | defaultPos, active, orderKey, genome, scientificName, | ||
htmlPath, hgNearOk, hgPbOk, sourceName) | htmlPath, hgNearOk, hgPbOk, sourceName, taxId) | ||
VALUES | VALUES | ||
("abcDef1", "July 2008", "/gbdb/abcDef1", "A. organism", | ("abcDef1", "July 2008", "/gbdb/abcDef1", "A. organism", | ||
"chr1:10459784-10469783", 1, 123, "A. organism", "Genus species", | "chr1:10459784-10469783", 1, 123, "A. organism", "Genus species", | ||
"/gbdb/abcDef1/html/description.html", 0, 0, "new genome version 1.0"); | "/gbdb/abcDef1/html/description.html", 0, 0, "new genome version 1.0", 12345); | ||
</PRE> | </PRE> | ||
Line 63: | Line 68: | ||
<PRE> | <PRE> | ||
INSERT INTO defaultDb (genome, name) VALUES ("A. organism", "abcDef1") | hgsql hgcentral -e 'INSERT INTO defaultDb (genome, name) VALUES ("A. organism", "abcDef1");' | ||
INSERT INTO genomeClade (genome, clade, priority) VALUES ("A. organism", "vertebrate", 123) | hgsql hgcentral -e 'INSERT INTO genomeClade (genome, clade, priority) VALUES ("A. organism", "vertebrate", 123);' | ||
</PRE> | </PRE> | ||
the genomeClade.priority helps choose the default genome for that clade. See examples of these values in the existing defaultDb and genomeClade tables. | the genomeClade.priority helps choose the default genome for that clade. See examples of these values in the existing defaultDb and genomeClade tables. You can verify the hgcentral table relationships with a join command on these tables: | ||
<pre> | |||
hgsql -e "SELECT d.name,d.orderKey,g.genome,g.priority,g.clade,d.scientificName FROM | |||
dbDb d, genomeClade g | |||
WHERE d.organism = g.genome | |||
ORDER by d.orderKey;" hgcentral | |||
</pre> | |||
13. Make a trackDb hierarchy of your genome. Populate it with trackDb.ra files and a description.html file. See existing examples in the source tree <EM>~/kent/src/hg/makeDb/trackDb/</EM>. It does not have to be in the source tree. You can load it into a separate database with the hgTrackDb and hgFindSpec commands. Place a reference to this trackDb extra database via your cgi-bin/hg.conf options. | 13. Make a trackDb hierarchy of your genome. Populate it with trackDb.ra files and a description.html file. See existing examples in the source tree <EM>~/kent/src/hg/makeDb/trackDb/</EM>. It does not have to be in the source tree. You can load it into a separate database with the hgTrackDb and hgFindSpec commands. Place a reference to this trackDb extra database via your cgi-bin/hg.conf options. The following page has information on TrackDb files which are identical for hubs or native assembly tracks: | ||
https://genome.ucsc.edu/goldenPath/help/hubQuickStart.html | |||
==See also== | ==See also== |
Latest revision as of 23:56, 12 February 2020
Prerequisites: you have the kent source tree checked out in your home directory ~/kent/src/
and you are familiar with the contents of the README files in ~/kent/src/product/README.*
You have built all of the utilities (see those README files: src/product
- Executable and Source Code Downloads -> http://hgdownload.soe.ucsc.edu/downloads.html#source_downloads
- Utilities, README -> http://genome-source.soe.ucsc.edu/gitlist/kent.git/blob/master/src/userApps/README
1. Organize your work. Decide on a database name. UCSC bases our names on the binomial nomenclature. The UCSC naming scheme is abcDef1 where abc is the first three letters of the genus, Def is the first three letters of the species, the 1 is the version of the genome. Versions start at 1. At UCSC we use a symlink from /cluster/data/abcDef1 -> to a filesystem that has enough data space to contain the build. Thus, all genome builds can be found from /cluster/data/ despite their actual NFS filesystem location. For this discussion, our new genome name will be simply: abcDef1. A few files are kept in /cluster/data/abcDef1/ but most work files for tracks are kept in /cluster/data/abcDef1/bed/trackName/
2. You should have fasta file(s) of your genome sequence and an AGP file describing their construction from contigs into scaffolds, or scaffolds into chromosomes, or combinations thereof. AGP files can be constructed purely from fasta files if no AGP file exists (hgFakeAgp). Usually assemblers will be producing an AGP file. To mark all N's as gaps in your fake AGP:
hgFakeAgp -minContigGap=1 newGenome.fa abcDef1.agp
3. Convert your fasta to 2bit format:
$ faToTwoBit newGenome.fa abcDef1.2bit $ mkdir /gbdb/abcDef1 $ mkdir /gbdb/abcDef1/html $ ln -s `pwd`/abcDef1.2bit /gbdb/abcDef1/abcDef1.2bit
4. verify your agp file matches your fasta file:
$ sort -k1,1 -k2n,2n original.agp > abcDef1.agp $ checkAgpAndFa abcDef1.agp abcDef1.2bit
5. Create a chromInfo file:
$ twoBitInfo abcDef1.2bit stdout | sort -k2nr > chrom.sizes $ mkdir -p bed/chromInfo $ awk '{printf "%s\t%d\t/gbdb/abcDef1/abcDef1.2bit\n", $1, $2}' \ chrom.sizes > bed/chromInfo/chromInfo.tab
6. Start your new database:
(you may need to update your MySQL database permissions to allow access to your new database) $ hgsql -e "create database abcDef1;" mysql
7. Load the grp table:
$ hgsql abcDef1 < $HOME/kent/src/hg/lib/grp.sql
8. Load the chromInfo table:
$ hgLoadSqlTab abcDef1 chromInfo $HOME/kent/src/hg/lib/chromInfo.sql \ bed/chromInfo/chromInfo.tab
9. Load the gold and gap tables from your AGP file:
$ hgGoldGapGl abcDef1 abcDef1.agp
10. Generate the gc5Base data and load table:
$ mkdir bed/gc5Base $ cd bed/gc5Base $ hgGcPercent -wigOut -doGaps -file=stdout -win=5 -verbose=0 abcDef1 \ ../../abcDef1.2bit | wigEncode stdin gc5Base.{wig,wib} $ hgLoadWiggle -pathPrefix=/gbdb/abcDef1/wib \ abcDef1 gc5Base gc5Base.wig $ mkdir /gbdb/abcDef1/wib $ ln -s `pwd`/gc5Base.wib /gbdb/abcDef1/wib
11. Create the dbDb SQL insert statement. The orderKey is determined from existing dbDb entries. This creates the order of the pulldown menus in the gateway page. Place this into a file: dbDbInsert.sql and load it with the command: hgsql hgcentral < dbDbInsert.sql
INSERT INTO dbDb (name, description, nibPath, organism, defaultPos, active, orderKey, genome, scientificName, htmlPath, hgNearOk, hgPbOk, sourceName, taxId) VALUES ("abcDef1", "July 2008", "/gbdb/abcDef1", "A. organism", "chr1:10459784-10469783", 1, 123, "A. organism", "Genus species", "/gbdb/abcDef1/html/description.html", 0, 0, "new genome version 1.0", 12345);
12. Create defaultDb and genomeClade table SQL entries. For example:
hgsql hgcentral -e 'INSERT INTO defaultDb (genome, name) VALUES ("A. organism", "abcDef1");' hgsql hgcentral -e 'INSERT INTO genomeClade (genome, clade, priority) VALUES ("A. organism", "vertebrate", 123);'
the genomeClade.priority helps choose the default genome for that clade. See examples of these values in the existing defaultDb and genomeClade tables. You can verify the hgcentral table relationships with a join command on these tables:
hgsql -e "SELECT d.name,d.orderKey,g.genome,g.priority,g.clade,d.scientificName FROM dbDb d, genomeClade g WHERE d.organism = g.genome ORDER by d.orderKey;" hgcentral
13. Make a trackDb hierarchy of your genome. Populate it with trackDb.ra files and a description.html file. See existing examples in the source tree ~/kent/src/hg/makeDb/trackDb/. It does not have to be in the source tree. You can load it into a separate database with the hgTrackDb and hgFindSpec commands. Place a reference to this trackDb extra database via your cgi-bin/hg.conf options. The following page has information on TrackDb files which are identical for hubs or native assembly tracks:
https://genome.ucsc.edu/goldenPath/help/hubQuickStart.html