The Ensembl Browser: Difference between revisions
From genomewiki
Jump to navigationJump to search
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
I am trying to learn how Ensembl is structured. These notes are partially based on a workshop in 2010 at the EBI. | I am trying to learn how Ensembl is structured. These notes are partially based on a workshop in 2010 at the EBI. | ||
Basic Information | == Basic Information == | ||
* Everything is in mysql databases. No flat text files. | * Everything is in mysql databases. No flat text files. | ||
* Everything is programmed in PERL (except the UCSC programs?) | * Everything is programmed in PERL (except the UCSC programs?) | ||
Line 14: | Line 14: | ||
** "modules": perl modules for command-line tools | ** "modules": perl modules for command-line tools | ||
** "sql": database schemas for scripts (remember that everything reads/writes to mysql | ** "sql": database schemas for scripts (remember that everything reads/writes to mysql | ||
== Pipeline / scheduling system == | |||
* Job description, input data description and commands are written to MySQL, the cluster writes the results back to mysql | * Job description, input data description and commands are written to MySQL, the cluster writes the results back to mysql | ||
* Each node will only extract part of the | |||
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC479123/ the paper] has a rough general outline | * [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC479123/ the paper] has a rough general outline | ||
* Basic schema is in ensembl-pipeline/sql/table.sql with some documentation | * Basic schema is in ensembl-pipeline/sql/table.sql with some documentation | ||
* The most useful documentation is in [http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-doc/pipeline_docs/?root=ensembl in CVS pipeline-docs] | * The most useful documentation is in [http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-doc/pipeline_docs/?root=ensembl in CVS pipeline-docs] | ||
* The genebuild step is predicting genes | |||
* The xref step is connecting predicted genes to external identifiers | |||
* The compara step is aligning all genomes and predicted genes and then building phylogenetic trees for all proteins | |||
* The biomart step is de-normalizing all databases for faster access (All older biomart versions are accessible via the archived old ensembl versions) | |||
Genome data | == Genome data storage == | ||
* Basic schema is in ensembl/sql/table.sql, with quite a bit of documentation of the tables | * Basic schema is in ensembl/sql/table.sql, with quite a bit of documentation of the tables | ||
* Can be accessed via Perl API (slow) or via biomart.org (~table browser, fast and convenient) or via direct sql queries | * Can be accessed via Perl API (slow) or via biomart.org (~table browser, fast and convenient) or via direct sql queries | ||
Line 42: | Line 40: | ||
* [[Ensembl minimum install]] | * [[Ensembl minimum install]] | ||
== Databases == | |||
All versions of the genomes are on the same server. Some ideas to help you find your way: | All versions of the genomes are on the same server. Some ideas to help you find your way: | ||
* | * Database names follow the schema <species>_<databaseType>_<releaseNumber>_<assemblyNumber>_<ChangesSinceLastAnnotation> | ||
** assembly number is the assembly version number from NCBI in the case of human | |||
** release is updated every 2 months | |||
** e.g. Homo_sapiens_core_59_37d | |||
** e.g. Homo_sapiens_core_58_37d | |||
** e.g. Homo_sapiens_core_57_37d | |||
** As you can see, the '''last''' letter is the most important one - you can see that there are no changes at all to the human annotations, as the final "d" has not changed! | |||
* [[ensembl_compara]] includes homologies between proteins and genomes | * [[ensembl_compara]] includes homologies between proteins and genomes | ||
* ensembl_go_version: Not used anymore? Was used to store gene ontology links. | * ensembl_go_version: Not used anymore? Was used to store gene ontology links. | ||
Line 52: | Line 54: | ||
* ensembl_ancestral_version ?? | * ensembl_ancestral_version ?? | ||
== Species database == | |||
* Sequences can be accessed using different "coordinate systems", e.g. you can type in a chromsome location or alternatively a contig location. Both will be mapped to chromsome sequences. They are set up in the table 'coord_system' | * Sequences can be accessed using different "coordinate systems", e.g. you can type in a chromsome location or alternatively a contig location. Both will be mapped to chromsome sequences. They are set up in the table 'coord_system' | ||
* The sequences themselved are stored in the table 'dna' and information about them in 'seq_region'. There is a table dnac for compressed sequences but its empty. | * The sequences themselved are stored in the table 'dna' and information about them in 'seq_region'. There is a table dnac for compressed sequences but its empty. | ||
* genes are linked to synonyms/names via [[Xref in Ensembl|xref-tables]]. | * genes are linked to synonyms/names via [[Xref in Ensembl|xref-tables]]. | ||
== Documentation == | |||
* Most documentation is not accessible from the Ensembl homepage. Most useful documentation is available via "ensembl-doc" / "pipeline_docs": [http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-doc/pipeline_docs/?root=ensembl] The file [http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-doc/pipeline_docs/overview.txt?revision=1.6&root=ensembl&view=markup overview.txt] gives a very good introduction. | |||
Documentation | |||
* Most documentation is not accessible from the Ensembl homepage. | |||
[[Category:Technical FAQ]] | [[Category:Technical FAQ]] |
Revision as of 10:33, 15 September 2010
I am trying to learn how Ensembl is structured. These notes are partially based on a workshop in 2010 at the EBI.
Basic Information
- Everything is in mysql databases. No flat text files.
- Everything is programmed in PERL (except the UCSC programs?)
- Main documentation start page
- Bert Overduin's homepage has a list of all slides and exercises - very handy!
- Parts of source code:
- "core": genome database and related tools
- "pipeline": the job scheduling system + config files
- "analysis": all genome annotation tools and wrappers
- Subdirectories of source parts:
- "scripts": command-line tools (mostly PERL)
- "modules": perl modules for command-line tools
- "sql": database schemas for scripts (remember that everything reads/writes to mysql
Pipeline / scheduling system
- Job description, input data description and commands are written to MySQL, the cluster writes the results back to mysql
- Each node will only extract part of the
- the paper has a rough general outline
- Basic schema is in ensembl-pipeline/sql/table.sql with some documentation
- The most useful documentation is in in CVS pipeline-docs
- The genebuild step is predicting genes
- The xref step is connecting predicted genes to external identifiers
- The compara step is aligning all genomes and predicted genes and then building phylogenetic trees for all proteins
- The biomart step is de-normalizing all databases for faster access (All older biomart versions are accessible via the archived old ensembl versions)
Genome data storage
- Basic schema is in ensembl/sql/table.sql, with quite a bit of documentation of the tables
- Can be accessed via Perl API (slow) or via biomart.org (~table browser, fast and convenient) or via direct sql queries
- Database schema documentation
- The database schema is very complex, due to self-referencing tables, whole-genome queries are not possible without biomart at reasonable speed
- An update of everything is done every 6 months. The old code, the old API and all databases are archived. Different mysql servers running on different ports are used to separated older archived from current versions.
- Genes are not re-predicted each time but only when new data is added to the gene build. The starting month of the last update of a gene build is stored in genome_db.genebuild (not the month when the genebuild ended, so I don't see how you know if genes changed)
- the current version can be found out with:
select * from meta meta where meta_key in ("schema_version", "patch")
- Usually, each species has its own database, like in the UCSC browser. The current human one is 'homo_sapiens_core_56_37a'
- The Web interface is called "webcode", written in Perl and makes extensive use of inheritance (uh-oh), tool-support for reading the code might be helpful
- The database structure is very normalized. Whereas this is nice from a software engineering perspective, you cannot do large-scale requests. E.g. downloading all homologs between two genomes involves queries on self-referencing tables which take ages to resolve and will time out if run on their server. Use biomart for these types of requests.
- There are still a lot of older functions lingering in the source code. If a function returns null although it shouldn't have a look into the source code. Often they have been replaced by others. The ensembl-dev mailing list is a good way to get more information.
- Ensembl minimum install
Databases
All versions of the genomes are on the same server. Some ideas to help you find your way:
- Database names follow the schema <species>_<databaseType>_<releaseNumber>_<assemblyNumber>_<ChangesSinceLastAnnotation>
- assembly number is the assembly version number from NCBI in the case of human
- release is updated every 2 months
- e.g. Homo_sapiens_core_59_37d
- e.g. Homo_sapiens_core_58_37d
- e.g. Homo_sapiens_core_57_37d
- As you can see, the last letter is the most important one - you can see that there are no changes at all to the human annotations, as the final "d" has not changed!
- ensembl_compara includes homologies between proteins and genomes
- ensembl_go_version: Not used anymore? Was used to store gene ontology links.
- ensembl_website_version: Ensembl includes some sort of content management system. This databases includes help articles, bugs, news, the list of species on the frontpage etc. (This database looks somewhat similar to hgcentral)
- ensembl_ancestral_version ??
Species database
- Sequences can be accessed using different "coordinate systems", e.g. you can type in a chromsome location or alternatively a contig location. Both will be mapped to chromsome sequences. They are set up in the table 'coord_system'
- The sequences themselved are stored in the table 'dna' and information about them in 'seq_region'. There is a table dnac for compressed sequences but its empty.
- genes are linked to synonyms/names via xref-tables.
Documentation
- Most documentation is not accessible from the Ensembl homepage. Most useful documentation is available via "ensembl-doc" / "pipeline_docs": [1] The file overview.txt gives a very good introduction.