Ensembl data load: Difference between revisions
From genomewiki
Jump to navigationJump to search
No edit summary |
|||
Line 1: | Line 1: | ||
To load data into an Ensembl database, one has to add analysis steps. The first step is always the sequences themselves. The following | To load data into an Ensembl database, one has to add analysis steps (this is similar to a job description in a job scheduling system like parasol, LSF, SGE, etc). The first step is always pulling in the sequences themselves ("dummy" step). The following step will process these sequences. "Rules" are added at the end to say which step is based on which other step. | ||
== Load Repeatmasker file == | == Load Repeatmasker file == |
Revision as of 09:02, 14 September 2010
To load data into an Ensembl database, one has to add analysis steps (this is similar to a job description in a job scheduling system like parasol, LSF, SGE, etc). The first step is always pulling in the sequences themselves ("dummy" step). The following step will process these sequences. "Rules" are added at the end to say which step is based on which other step.
Load Repeatmasker file
- The make things easier, let's set a little shortcut:
export DBSPEC="-dbhost 127.0.0.1 -dbuser ens-training -dbport 3306 -dbname mouse37_mini_ref -dbpass workshop"
- Run repeatmasker on a fasta file:
RepeatMasker -species mouse -qq -dir <full_path_to_output_directory> $HOME/workshop/genebuild/test_seqs/test_sequence_to_repeatmask.fa
- Analysis Step 1: Create a "dummy analysis file" which will simply select the sequences to analyse (here: contigs), e.g. create a file submit_ana.conf:
[SubmitContig] module=Dummy input_id_type=CONTIG
- Load the "dummy analysis" into the database
$HOME/cvs_checkout/ensembl-pipeline/scripts/analysis_setup.pl $DBSPEC -read -file repeatmask_ana.conf
- Analysis Step 2: Define the real analysis, e.g. repeatmask_ana.conf
[RepeatMask] db=repbase db_version=0129 db_file=repbase program=RepeatMask program_version=3.1.8 program_file=/path/to/repmasker/RepeatMask parameters=-nolow -species mouse -s module=RepeatMask gff_source=RepeatMask gff_feature=repeat input_id_type=CONTIG
- load the analysis into the mysql database
$HOME/cvs_checkout/ensembl-pipeline/scripts/analysis_setup.pl $DBSPEC -read -file repeatmask_ana.conf
- see what happened:
SELECT * from analysis\G *************************** 1. row *************************** analysis_id: 1 created: 2010-09-13 16:50:16 logic_name: SubmitContig db: NULL db_version: NULL db_file: NULL program: NULL program_version: NULL program_file: NULL parameters: NULL module: Dummy module_version: NULL gff_source: NULL gff_feature: NULL *************************** 2. row *************************** analysis_id: 2 created: 2010-09-13 16:14:11 logic_name: RepeatMask db: repbase db_version: 0129 db_file: repbase program: RepeatMask program_version: 3.1.8 program_file: /path/to/repmasker/RepeatMask parameters: -nolow -species mouse -s module: RepeatMask module_version: NULL gff_source: RepeatMask gff_feature: repeat
- Add a rule which says that RepeatMask requires the contig sequences:
perl $HOME/cvs_checkout/ensembl-pipeline/scripts/RuleHandler.pl $DBSPEC \ -insert -goal RepeatMask \ -condition SubmitContig
- Add the "input_ids" step which adds the sequences to the job description:
perl $HOME/cvs_checkout/ensembl-pipeline/scripts/make_input_ids $DBSPEC -logic_name SubmitContig -coord_system contig -slice 150k
- check what has changed:
select ia.input_id,a.logic_name from input_id_analysis ia, analysis a where ia.analysis_id = a.analysis_id ; +---------------------------------------+--------------+ | input_id | logic_name | +---------------------------------------+--------------+ | contig:NCBIM37:AC087062.25:1:224451:1 | SubmitContig | | contig:NCBIM37:AC138620.4:1:209846:1 | SubmitContig | | contig:NCBIM37:AC153919.8:1:264561:1 | SubmitContig | | contig:NCBIM37:AL589742.21:1:125641:1 | SubmitContig |