Ensembl minimum install: Difference between revisions

From genomewiki
Jump to navigationJump to search
No edit summary
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
Necessary files for this example, see [[Image:EnsemblWorkshopFiles.tar.gz]]. The examples are just copy/pasted from the workshop documentation prepared by Bronwen Aken and Jan Vogel, with some notes added by myself. The original documentation can be found on [http://www.voeglein.de/gi_workshop_2008/Exercise%201.doc Jan Vogel's homepage] and in the tar archive referenced above, in the doc/ directory.
This example uses some demo files prepared by Jan Vogel and Bronwen Aken. Download and extract them from [[Image:EnsemblWorkshopFiles.tar.gz]], then run your tcsh and do "source set_perl_path.sh".  


For historical reasons, Ensembl is not loading the big NCBI chromosome fasta files, but only the small contigs and their AGP files (which maps contigs -> chromosomes). The following sections are describing the loading of this whole "coordinate system" with its sequences. It will also load some default data into three tables to bring the database into a state that is accepted by next exercises.
The example commands on this page are just copy/pasted from the workshop documentation prepared by Bronwen Aken and Jan Vogel, with some notes added by myself. The original documentation can be found on [http://www.voeglein.de/gi_workshop_2008/Exercise%201.doc Jan Vogel's homepage] and in the tar archive referenced above, in the doc/ directory.
 
For historical reasons, Ensembl is not loading the big NCBI chromosome fasta files, but only the small contigs and their [http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP_Specification.shtml#FORMAT AGP] files (a golden path = maps contigs -> chromosomes). The following sections are describing the loading of this whole "coordinate system" with its sequences. It will also load some default data into three tables to bring the database into a state that is accepted by next exercises.
 
Another reason for not loading a whole chromosome into a single MySQL row is that the maximum size of mysql packets is ~10MB (depending on your settings). In addition, MySQL's get-substring implementation will always retrieve the whole row which creates slows down retrieval of short sequences. These are reasons why usually 10-200kb sequences are loaded into the databases and these fragments are then linked by .AGP files.  


== Load coordinate systems into tables seq_region and coord_system ==
== Load coordinate systems into tables seq_region and coord_system ==
Line 19: Line 23:
<code>perl $PS/load_seq_region.pl $DBSPEC -coord_system_name chromosome -coord_system_version NCBIM37 -rank 1 -default_version -agp_file $HOME/workshop/genebuild/assembly/mini_chr_contig.agp</code>
<code>perl $PS/load_seq_region.pl $DBSPEC -coord_system_name chromosome -coord_system_version NCBIM37 -rank 1 -default_version -agp_file $HOME/workshop/genebuild/assembly/mini_chr_contig.agp</code>
* super_contig -> contig mappings  into seq_region and coord_system:
* super_contig -> contig mappings  into seq_region and coord_system:
<code>perl $PS/load_seq_region.pl -coord_system_name supercontig -default_version -rank 2 -coord_system_version NCBIM37 -agp_file $HOME/workshop/genebuild/assembly/mini_supercontig_contig.agp -verbose</code>
  perl $PS/load_seq_region.pl $DBSPEC -coord_system_name supercontig -default_version \
  -rank 2 -coord_system_version NCBIM37 -agp_file $HOME/workshop/genebuild/assembly/mini_supercontig_contig.agp -verbose
* See what's going on with:
* See what's going on with:
   select * from seq_region
   select * from seq_region
Line 25: Line 30:
   select * from dna;
   select * from dna;
* contigs (this is the only step that actually loads sequences into the "dna" table):
* contigs (this is the only step that actually loads sequences into the "dna" table):
   perl $PS/load_seq_region.pl -coord_system_name contig -default_version -rank 3 -sequence_level -coord_system_version NCBIM37 -fasta_file /home/ensembl/workshop/genebuild/assembly/clones_finished_mini.fa </code>
   perl $PS/load_seq_region.pl $DBSPEC -coord_system_name contig -default_version -rank 3 -sequence_level -coord_system_version NCBIM37 -fasta_file /home/ensembl/workshop/genebuild/assembly/clones_finished_mini.fa  
* clones (only this command loads sequences into the "dna" table):
* clones :
perl $PS/load_seq_region.pl $DBSPEC -coord_system_name clone -default_version -coord_system_version NCBIM37 -rank 4 -agp_file /home/ensembl/workshop/genebuild/assembly/mini_clone_contig.agp
  perl $PS/load_seq_region.pl $DBSPEC -coord_system_name clone -default_version -coord_system_version NCBIM37 \
  -rank 4 -agp_file /home/ensembl/workshop/genebuild/assembly/mini_clone_contig.agp
* See what's going on with:
* See what's going on with:
   select * from seq_region
   select * from seq_region
Line 59: Line 65:
   select * from assembly
   select * from assembly
<code>
<code>
+-------------------+-------------------+-----------+-----------+-----------+---------+-----+
  +-------------------+-------------------+-----------+-----------+-----------+---------+-----+
| asm_seq_region_id | cmp_seq_region_id | asm_start | asm_end   | cmp_start | cmp_end | ori |
  | asm_seq_region_id | cmp_seq_region_id | asm_start | asm_end   | cmp_start | cmp_end | ori |
+-------------------+-------------------+-----------+-----------+-----------+---------+-----+
  +-------------------+-------------------+-----------+-----------+-----------+---------+-----+
|                 1 |                14 | 129260521 | 129429106 |     41261 |  209846 |   1 |
  |                 1 |                14 | 129260521 | 129429106 |     41261 |  209846 |   1 |
|                 2 |                17 |  69703858 |  69950060 |      2001 |  248203 |   1 |
  |                 2 |                17 |  69703858 |  69950060 |      2001 |  248203 |   1 |
|                 3 |                13 |  94665450 |  94866948 |     22953 |  224451 |   1 |
  |                 3 |                13 |  94665450 |  94866948 |     22953 |  224451 |   1 |
|                 4 |                16 |  21549325 |  21662718 |      2001 |  115394 |   1 |
  |                 4 |                16 |  21549325 |  21662718 |      2001 |  115394 |   1 |
|                 4 |                16 |  21662719 |  21672889 |    115471 |  125641 |   1 |
  +-------------------+-------------------+-----------+-----------+-----------+---------+-----+
|                 5 |                15 |  81038621 |  81119472 |         1 |   80852 |  -1 |
|                 6 |                18 |   3208471 |   3436586 |      2001 |  230116 |   1 |
|                 7 |                15 |  23370008 |  23450859 |         1 |   80852 |  -1 |
|                 8 |                14 |  81573483 |  81742068 |     41261 |  209846 |   1 |
|                 9 |                13 |  44047803 |  44249301 |     22953 |  224451 |   1 |
|                10 |                16 |  10375984 |  10489377 |      2001 |  115394 |   1 |
|                10 |                16 |  10489378 |  10499548 |    115471 |  125641 |   1 |
|                11 |                17 |  35208736 |  35454938 |      2001 |  248203 |   1 |
|                12 |                18 |    208471 |    436586 |      2001 |  230116 |   1 |
+-------------------+-------------------+-----------+-----------+-----------+---------+-----+
</code>
</code>
The original documentation writes about this table: ''The column asm_seq_region_id links to the seq_region table and refers to the assembled sequence (longer sequence). The column cmp_seq_region_id links to the seq_region table and refers to the component sequence (shorter sequence).''
The original documentation writes about this table: ''The column asm_seq_region_id links to the seq_region table and refers to the assembled sequence (longer sequence). The column cmp_seq_region_id links to the seq_region table and refers to the component sequence (shorter sequence).''
Line 93: Line 89:


== Check the sequences in the database ==  
== Check the sequences in the database ==  
Output all sequences that are part of a coordinate system to fasta files in a given directory:
   perl $HOME/cvs_checkout/ensembl-analysis/scripts/sequence_dump.pl $DBSPEC -coord_system_name chromosome -output_dir /tmp
   perl $HOME/cvs_checkout/ensembl-analysis/scripts/sequence_dump.pl $DBSPEC -coord_system_name chromosome -output_dir /tmp

Latest revision as of 10:38, 15 September 2010

This example uses some demo files prepared by Jan Vogel and Bronwen Aken. Download and extract them from File:EnsemblWorkshopFiles.tar.gz, then run your tcsh and do "source set_perl_path.sh".

The example commands on this page are just copy/pasted from the workshop documentation prepared by Bronwen Aken and Jan Vogel, with some notes added by myself. The original documentation can be found on Jan Vogel's homepage and in the tar archive referenced above, in the doc/ directory.

For historical reasons, Ensembl is not loading the big NCBI chromosome fasta files, but only the small contigs and their AGP files (a golden path = maps contigs -> chromosomes). The following sections are describing the loading of this whole "coordinate system" with its sequences. It will also load some default data into three tables to bring the database into a state that is accepted by next exercises.

Another reason for not loading a whole chromosome into a single MySQL row is that the maximum size of mysql packets is ~10MB (depending on your settings). In addition, MySQL's get-substring implementation will always retrieve the whole row which creates slows down retrieval of short sequences. These are reasons why usually 10-200kb sequences are loaded into the databases and these fragments are then linked by .AGP files.

Load coordinate systems into tables seq_region and coord_system

You need the fasta and AGP files for an assembly. Ensembl supports multiple coordinate systems: Any piece of DNA can be referenced by its chromosomal location (1:1000), or other systems, like its super_contig location (NT_039500:1-1000).

Coordinate systems have a "rank" of importance (the higher the better), and a "version" (so the database contains information for several possible assemblies of the same contigs and annotations can be loaded that are based on several different versions).

Only sequence fragments marked as "toplevel" are shown on the final website (see below). Therefore, you might have never seen an individual contig NT_123453 on the website.

  • The make things easier, let's set a little shortcut:
 export DBSPEC="-dbhost 127.0.0.1 -dbuser ens-training -dbport 3306 -dbname mouse37_mini_ref -dbpass workshop"
  • Create an empty database named mouse37_mini_ref and populate it with the CORE schema:
 mysql -uens-training -pworkshop -h127.0.0.1 -P3306 -D mouse37_mini_ref < $HOME/cvs_checkout/ensembl/sql/table.sql
  • Load coordinates and actual sequences into the empty core database:
  • chromosome -> super_contig mappings into seq_region and coord_system:

perl $PS/load_seq_region.pl $DBSPEC -coord_system_name chromosome -coord_system_version NCBIM37 -rank 1 -default_version -agp_file $HOME/workshop/genebuild/assembly/mini_chr_contig.agp

  • super_contig -> contig mappings into seq_region and coord_system:
 perl $PS/load_seq_region.pl $DBSPEC -coord_system_name supercontig -default_version \
 -rank 2 -coord_system_version NCBIM37 -agp_file $HOME/workshop/genebuild/assembly/mini_supercontig_contig.agp -verbose
  • See what's going on with:
 select * from seq_region
 select * from coord_system 
 select * from dna;
  • contigs (this is the only step that actually loads sequences into the "dna" table):
 perl $PS/load_seq_region.pl $DBSPEC -coord_system_name contig -default_version -rank 3 -sequence_level -coord_system_version NCBIM37 -fasta_file /home/ensembl/workshop/genebuild/assembly/clones_finished_mini.fa 
  • clones :
  perl $PS/load_seq_region.pl $DBSPEC -coord_system_name clone -default_version -coord_system_version NCBIM37 \
  -rank 4 -agp_file /home/ensembl/workshop/genebuild/assembly/mini_clone_contig.agp
  • See what's going on with:
 select * from seq_region
 select * from coord_system 
 select * from dna;
  • Delete version numbers from coord_system table for contig and clone (why is this necessary?):
 select * from coord_system ;
 update coord_system set version=NULL where name ='clone' ;
 update coord_system set version=NULL where name ='contig' ;
 select * from coord_system ;

 +-----------------+-------------+---------+------+--------------------------------+
 | coord_system_id | name        | version | rank | attrib                         |
 +-----------------+-------------+---------+------+--------------------------------+
 |               1 | chromosome  | NCBIM37 |    1 | default_version                |
 |               2 | supercontig | NCBIM37 |    2 | default_version                |
 |               3 | contig      | NULL    |    3 | default_version,sequence_level |
 |               4 | clone       | NULL    |    4 | default_version                |
 +-----------------+-------------+---------+------+--------------------------------+ 

Load assembly information into MySQL tables assembly and meta

  • chromosome to contig mapping:
 perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name chromosome -component_name contig -agp_file  /home/ensembl/workshop/genebuild/assembly/mini_chr_contig.agp 
  • supercontig to contig (ignore all warnings):
 perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name supercontig -component_name contig -agp_file  /home/ensembl/workshop/genebuild/assembly/mini_supercontig_contig.agp 
  • check what's going on:
 select * from assembly

 +-------------------+-------------------+-----------+-----------+-----------+---------+-----+
 | asm_seq_region_id | cmp_seq_region_id | asm_start | asm_end   | cmp_start | cmp_end | ori |
 +-------------------+-------------------+-----------+-----------+-----------+---------+-----+
 |                 1 |                14 | 129260521 | 129429106 |     41261 |  209846 |   1 |
 |                 2 |                17 |  69703858 |  69950060 |      2001 |  248203 |   1 |
 |                 3 |                13 |  94665450 |  94866948 |     22953 |  224451 |   1 |
 |                 4 |                16 |  21549325 |  21662718 |      2001 |  115394 |   1 |
 +-------------------+-------------------+-----------+-----------+-----------+---------+-----+

The original documentation writes about this table: The column asm_seq_region_id links to the seq_region table and refers to the assembled sequence (longer sequence). The column cmp_seq_region_id links to the seq_region table and refers to the component sequence (shorter sequence).

  • clone to contig (ignore all warnings)
 perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name clone -component_name contig -agp_file  /home/ensembl/workshop/genebuild/assembly/mini_clone_contig.agp

Complete the assembly database tables =

  • As assemblies can contain references to external databases, we load a list of default references:
 perl $HOME/cvs_checkout/ensembl/misc-scripts/external_db/update_external_dbs.pl $DBSPEC -nonreleasemode -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/external_db/external_dbs.txt
  • contigs that are not located on chromosomes but exist by themselves need the toplevel attribute set. For this, first we need to define the attribute "toplevel", then link the toplevel attribute to the unmapped contigs.
 perl $HOME/cvs_checkout/ensembl/misc-scripts/attribute_types/upload_attributes.pl  $DBSPEC -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/attribute_types/attrib_type.txt
 perl $HOME/cvs_checkout/ensembl-pipeline/scripts/set_toplevel.pl $DBSPEC
  • Ensembl has a system to track sequences that are not mapped at all (ESTs, cDNAs, contigs, etc), so we also populate the "unmapped_reasons" table, though you can skip this.
 perl $HOME/cvs_checkout/ensembl/misc-scripts/unmapped_reason/update_unmapped_reasons.pl $DBSPEC -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/unmapped_reason/unmapped_reason.txt

Check the sequences in the database

Output all sequences that are part of a coordinate system to fasta files in a given directory:

 perl $HOME/cvs_checkout/ensembl-analysis/scripts/sequence_dump.pl $DBSPEC -coord_system_name chromosome -output_dir /tmp