Ensembl data load: Difference between revisions

Latest revision as of 15:59, 14 September 2010

The Ensembl system integrates a job scheduler with the data loaders. You schedule a repeatmasker run with all parameters in mysql tables and find the results of this run later in your mysql databases. The complete job description is called an "analysis". It consists of individual steps which are connected by "rules".

The first step is always pulling in the sequences themselves ("dummy" step). The next steps will process these sequences. "Rules" are added at the end to say which step is based on which other step. This whole system is called "ensembl-pipeline". There is a new system written by the ensembl-compara people called "ensembl-hive" which we are not covering here. The hive is more straightforward to use in some respect, see the CVS module ensembl-hive and its doc-directory.

The advantage of this system is that analyses are defined once and can then be applied to any genome, while the system is unfortunately quite difficult to grasp at first.

You need to download / extract the example files from File:EnsemblWorkshopFiles.tar.gz for the following steps.

Set up sequences

The make things easier, let's set a little shortcut:

setenv DBSPEC "-dbhost 127.0.0.1 -dbuser ens-training -dbport 3306 -dbname mouse37_mini_ref -dbpass workshop"

Create a "dummy analysis file" which will simply select the sequences to analyse (here: contigs), e.g. create a file submit_ana.conf:

[SubmitContig]
module=Dummy
input_id_type=CONTIG

Load the "dummy analysis" into the database

$HOME/cvs_checkout/ensembl-pipeline/scripts/analysis_setup.pl $DBSPEC -read -file submit_ana.conf

Add the "input_ids" step which adds the sequences to the job description:

 perl $HOME/cvs_checkout/ensembl-pipeline/scripts/make_input_ids $DBSPEC -logic_name SubmitContig -coord_system contig

Set up repeatmasking of sequences

Define the real analysis, e.g. repeatmask_ana.conf

[RepeatMask]
db=repbase
db_version=0129
db_file=repbase
program=RepeatMask
program_version=3.1.8
program_file=/path/to/repmasker/RepeatMask
parameters=-nolow -species mouse -s
module=RepeatMask
gff_source=RepeatMask
gff_feature=repeat
input_id_type=CONTIG

load the analysis into the mysql database

$HOME/cvs_checkout/ensembl-pipeline/scripts/analysis_setup.pl $DBSPEC -read -file repeatmask_ana.conf

see what happened:

SELECT * from analysis\G
*************************** 1. row ***************************
   analysis_id: 1
       created: 2010-09-13 16:50:16
    logic_name: SubmitContig
            db: NULL
    db_version: NULL
       db_file: NULL
       program: NULL
program_version: NULL
  program_file: NULL
    parameters: NULL
        module: Dummy
module_version: NULL
    gff_source: NULL
   gff_feature: NULL
*************************** 2. row ***************************
   analysis_id: 2
       created: 2010-09-13 16:14:11
    logic_name: RepeatMask
            db: repbase
    db_version: 0129
       db_file: repbase
       program: RepeatMask
program_version: 3.1.8
  program_file: /path/to/repmasker/RepeatMask
    parameters: -nolow -species mouse -s
        module: RepeatMask
module_version: NULL
    gff_source: RepeatMask
   gff_feature: repeat

Add a rule which says that RepeatMask requires the contig sequences:

perl $HOME/cvs_checkout/ensembl-pipeline/scripts/RuleHandler.pl $DBSPEC \
-insert -goal RepeatMask \
-condition SubmitContig

check what has changed:

select ia.input_id,a.logic_name from input_id_analysis ia, analysis a where ia.analysis_id = a.analysis_id ;
+---------------------------------------+--------------+
| input_id                              | logic_name   |
+---------------------------------------+--------------+
| contig:NCBIM37:AC087062.25:1:224451:1 | SubmitContig | 
| contig:NCBIM37:AC138620.4:1:209846:1  | SubmitContig | 
| contig:NCBIM37:AC153919.8:1:264561:1  | SubmitContig | 
| contig:NCBIM37:AL589742.21:1:125641:1 | SubmitContig |

Run pipeline

Have a look at the central pipeline config file:

$HOME/workshop/genebuild/configs/pipeline_config/modules/Bio/EnsEMBL/Pipeline/Config/BatchQueue.pm

Add the config file to your pipeline system:

export PERL5LIB=$HOME/workshop/genebuild/configs/pipeline_config/modules:${PERL5LIB}

Test the pipeline

perl $HOME/cvs_checkout/ensembl-analysis/scripts/test_RunnableDB $DBSPEC\
-logic_name RepeatMask \
-input_id contig::AC087062.25:1:224451:1 \
-dbpass workshop –verbose

Run the pipeline

perl $HOME/cvs_checkout/ensembl-pipeline/scripts/rulemanager.pl $DBSPEC\
-logic_name RepeatMask \

You can use the script "monitor $DBSPEC -current -finishedp" to check how much is already done.
dump out the genome again to see how the repeat masked sequences are now in lowercase:

perl $HOME/cvs_checkout/ensembl-analysis/scripts/sequence_dump.pl $DBSPEC\
-softmask -mask_repeat RepeatMask\
-coord_system_name chromosome \
-output_dir $HOME/workshop/genebuild/output/softmasked_seq

Ensembl data load: Difference between revisions

Latest revision as of 15:59, 14 September 2010

Set up sequences

Set up repeatmasking of sequences

Run pipeline

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

related sites

hosted projects

Tools

@@ Line 1: / Line 1: @@
-== Load Repeatmasker file ==
+The Ensembl system integrates a job scheduler with the data loaders. You schedule a repeatmasker run with all parameters in mysql tables and find the results of this run later in your mysql databases. The complete job description is called an "analysis". It consists of individual steps which are connected by "rules".
-*  The make things easier, let's set a little shortcut:
- export DBSPEC="-dbhost 127.0.0.1 -dbuser ens-training -dbport 3306 -dbname mouse37_mini_ref -dbpass workshop"
+The first step is always pulling in the sequences themselves ("dummy" step). The next steps will process these sequences. "Rules" are added at the end to say which step is based on which other step. This whole system is called "ensembl-pipeline". There is a new system written by the ensembl-compara people called "ensembl-hive" which we are not covering here. The hive is more straightforward to use in some respect, see the CVS module ensembl-hive and its doc-directory.
+The advantage of this system is that analyses are defined once and can then be applied to any genome, while the system is unfortunately quite difficult to grasp at first.
-* Run repeatmasker on a fasta file:
+You need to download / extract the example files from [[Image:EnsemblWorkshopFiles.tar.gz]] for the following steps.
- RepeatMasker -species mouse -qq -dir <full_path_to_output_directory> $HOME/workshop/genebuild/test_seqs/test_sequence_to_repeatmask.fa
+== Set up sequences ==
+*  The make things easier, let's set a little shortcut:
+ setenv DBSPEC "-dbhost 127.0.0.1 -dbuser ens-training -dbport 3306 -dbname mouse37_mini_ref -dbpass workshop"
 * Create a "dummy analysis file" which will simply select the sequences to analyse (here: contigs), e.g. create a file submit_ana.conf:
   [SubmitContig]
   module=Dummy
   input_id_type=CONTIG
-* Load the "dummy analysis"
+* Load the "dummy analysis" into the database
-  $HOME/cvs_checkout/ensembl-pipeline/scripts/analysis_setup.pl $DBSPEC -read -file repeatmask_ana.conf
+  $HOME/cvs_checkout/ensembl-pipeline/scripts/analysis_setup.pl $DBSPEC -read -file submit_ana.conf
+* Add the "input_ids" step which adds the sequences to the job description:
+  perl $HOME/cvs_checkout/ensembl-pipeline/scripts/make_input_ids $DBSPEC -logic_name SubmitContig -coord_system contig
+== Set up repeatmasking of sequences ==
 * Define the real analysis, e.g. repeatmask_ana.conf
   [RepeatMask]
@@ Line 28: / Line 36: @@
   $HOME/cvs_checkout/ensembl-pipeline/scripts/analysis_setup.pl $DBSPEC -read -file repeatmask_ana.conf
 * see what happened:
-  SELECT * from analysis;\G
+  SELECT * from analysis\G
   *************************** 1. row ***************************
      analysis_id: 1
+        created: 2010-09-13 16:50:16
+     logic_name: SubmitContig
+             db: NULL
+     db_version: NULL
+        db_file: NULL
+        program: NULL
+ program_version: NULL
+   program_file: NULL
+     parameters: NULL
+         module: Dummy
+ module_version: NULL
+     gff_source: NULL
+    gff_feature: NULL
+ *************************** 2. row ***************************
+    analysis_id: 2
          created: 2010-09-13 16:14:11
       logic_name: RepeatMask
@@ Line 44: / Line 67: @@
       gff_source: RepeatMask
      gff_feature: repeat
+* Add a rule which says that RepeatMask requires the contig sequences:
+ perl $HOME/cvs_checkout/ensembl-pipeline/scripts/RuleHandler.pl $DBSPEC \
+ -insert -goal RepeatMask \
+ -condition SubmitContig
+* check what has changed:
+ select ia.input_id,a.logic_name from input_id_analysis ia, analysis a where ia.analysis_id = a.analysis_id ;
+ +---------------------------------------+--------------+
+ | input_id                              | logic_name   |
+ +---------------------------------------+--------------+
+ | contig:NCBIM37:AC087062.25:1:224451:1 | SubmitContig |
+ | contig:NCBIM37:AC138620.4:1:209846:1  | SubmitContig |
+ | contig:NCBIM37:AC153919.8:1:264561:1  | SubmitContig |
+ | contig:NCBIM37:AL589742.21:1:125641:1 | SubmitContig |
+== Run pipeline ==
+* Have a look at the central pipeline config file:
+ $HOME/workshop/genebuild/configs/pipeline_config/modules/Bio/EnsEMBL/Pipeline/Config/BatchQueue.pm
+* Add the config file to your pipeline system:
+ export PERL5LIB=$HOME/workshop/genebuild/configs/pipeline_config/modules:${PERL5LIB}
+* Test the pipeline
+ perl $HOME/cvs_checkout/ensembl-analysis/scripts/test_RunnableDB $DBSPEC\
+ -logic_name RepeatMask \
+ -input_id contig::AC087062.25:1:224451:1 \
+ -dbpass workshop –verbose
+* Run the pipeline
+ perl $HOME/cvs_checkout/ensembl-pipeline/scripts/rulemanager.pl $DBSPEC\
+ -logic_name RepeatMask \
+* You can use the script "monitor $DBSPEC -current -finishedp" to check how much is already done.
+* dump out the genome again to see how the repeat masked sequences are now in lowercase:
+ perl $HOME/cvs_checkout/ensembl-analysis/scripts/sequence_dump.pl $DBSPEC\
+ -softmask -mask_repeat RepeatMask\
+ -coord_system_name chromosome \
+ -output_dir $HOME/workshop/genebuild/output/softmasked_seq