Building a new genome database: Difference between revisions

Revision as of 17:55, 18 August 2008

Caveat: This is a quick first pass (2008-07-23). There could be errors. For reference, check our automation tool that does all this in the source tree: ~/kent/src/hg/utils/automation/makeGenomeDb.pl

Prerequisites: you have the kent source tree checked out in your home directory ~/kent/src/ and you are familiar with the contents of the README files in ~/kent/src/product/README.*
You have built all of the utilities (see those README files ...)

1. Organize your work. Decide on a database name. UCSC bases our names on the binomial nomenclature. The UCSC naming scheme is abcDef1 where abc is the first three letters of the genus, Def is the first three letters of the species, the 1 is the version of the genome. Versions start at 1. At UCSC we use a symlink from /cluster/data/abcDef1 -> to a filesystem that has enough data space to contain the build. Thus, all genome builds can be found from /cluster/data/ despite their actual NFS filesystem location. For this discussion, our new genome name will be simply: abcDef1. A few files are kept in /cluster/data/abcDef1/ but most work files for tracks are kept in /cluster/data/abcDef1/bed/trackName/

2. You should have fasta file(s) of your genome sequence and an AGP file describing their construction from contigs into scaffolds, or scaffolds into chromosomes, or combinations thereof. AGP files can be constructed purely from fasta files if no AGP file exists (hgFakeAgp). Usually assemblers will be producing an AGP file.

3. Convert your fasta to 2bit format:

   $ faToTwoBit newGenome.fa abcDef1.2bit
   $ mkdir /gbdb/abcDef1
   $ mkdir /gbdb/abcDef1/html
   $ ln -s `pwd`/abcDef1.2bit /gbdb/abcDef1/abcDef1.2bit

4. verify your agp file matches your fasta file:

   $ sort -k1,1 -k2n,2n original.agp > abcDef1.agp
   $ checkAgpAndFa abcDef1.agp abcDef1.2bit

5. Create a chromInfo file:

   $ twoBitInfo abcDef1.2bit stdout | sort -k2nr > chrom.sizes
   $ mkdir -p bed/chromInfo
   $ awk '{printf "%s\t%d\t/gbdb/abcDef1/abcDef1.2bit\n", $1, $2}' \
          chrom.sizes > bed/chromInfo/chromInfo.tab

6. Start your new database:

   $ hgsql -e "create database abcDef1;" mysql

7. Load the grp table:

   $ hgsql abcDef1 < $HOME/kent/src/hg/lib/grp.sql

8. Load the chromInfo table:

   $ hgLoadSqlTab abcDef1 chromInfo $HOME/kent/src/hg/lib/chromInfo.sql \
             bed/chromInfo/chromInfo.tab

9. Load the gold and gap tables from your AGP file:

   $ hgGoldGapGl abcDef1 abcDef1.agp

10. Generate the gc5Base data and load table:

   $ mkdir bed/gc5Base
   $ hgGcPercent -wigOut -doGaps -file=stdout -win=5 -verbose=0 abcDef1 \
                 abcDef1.2bit | wigEncode stdin bed/gc5Base/gc5Base.{wig,wib}
   $ hgLoadWiggle -pathPrefix=/gbdb/abcDef1/wib \
                 abcDef1 gc5Base bed/gc5Base/gc5Base.wig
   $ mkdir /gbdb/abcDef1/wib
   $ ln -s `pwd`/bed/gc5Base/gc5Base.wib /gbdb/abcDef1/wib

11. Create the dbDb SQL insert statement. The orderKey is determined from existing dbDb entries. This creates the order of the pulldown menus in the gateway page. Place this into a file: dbDbInsert.sql and load it with the command: hgsql hgcentral < dbDbInsert.sql

INSERT INTO dbDb
    (name, description, nibPath, organism,
     defaultPos, active, orderKey, genome, scientificName,
     htmlPath, hgNearOk, hgPbOk, sourceName)
VALUES
    ("abcDef1", "July 2008", "/gbdb/abcDef1", "A. organism",
     "chr1:10459784-10469783", 1, 123, "A. organism", "Genus species",
     "/gbdb/abcDef1/html/description.html", 0, 0, "new genome version 1.0");

12. Make a trackDb hierarchy of your genome. Populate it with trackDb.ra files and a description.html file. See existing examples in the source tree ~/kent/src/hg/makeDb/trackDb/. It does not have to be in the source tree. You can load it into a separate database with the hgTrackDb and hgFindSpec commands. Place a reference to this trackDb extra database via your cgi-bin/hg.conf options.

Building a new genome database: Difference between revisions

Revision as of 17:55, 18 August 2008

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

related sites

hosted projects

Tools

@@ Line 8: / Line 8: @@
 .  Organize your work.  Decide on a database name.  UCSC bases our names on the binomial nomenclature.  The UCSC naming scheme is abcDef1 where abc is the first three letters of the genus, Def is the first three letters of the species, the 1 is the version of the genome.  Versions start at 1.  At UCSC we use a symlink from /cluster/data/abcDef1 -> to a filesystem that has enough data space to contain the build.  Thus, all genome builds can be found from /cluster/data/ despite their actual NFS filesystem location.  For this discussion, our new genome name will be simply: abcDef1.  A few files are kept in /cluster/data/abcDef1/ but most work files for tracks are kept in /cluster/data/abcDef1/bed/trackName/
-.  You should have fasta file(s) of your genome sequence and an AGP file describing their construction from contigs into scaffolds, or scaffolds into chromosomes, or combinations thereof.  AGP files can be constructed purely from fasta files if no AGP file exists.  Usually assemblers will be producing an AGP file.
+.  You should have fasta file(s) of your genome sequence and an AGP file describing their construction from contigs into scaffolds, or scaffolds into chromosomes, or combinations thereof.  AGP files can be constructed purely from fasta files if no AGP file exists (hgFakeAgp).  Usually assemblers will be producing an AGP file.
 . Convert your fasta to 2bit format:
-     $ twoBitToFa abcDef1.2bit newGenome.fa
+     $ faToTwoBit newGenome.fa abcDef1.2bit
      $ mkdir /gbdb/abcDef1
      $ mkdir /gbdb/abcDef1/html