Blastz

From genomewiki
Jump to navigationJump to search

Blastz is a nucleotide local alignment program developed by Webb Miller's group at PSU (http://www.bx.psu.edu/miller_lab/). It takes FASTA or nib sequences as input and produces output in its own format, lav. Some notable algorithmic improvements over traditional blast:

  • seeds matches using 12-of-19 tiles (see Ma et al. 2002)
  • scores pairs of bases using a substitution matrix (as opposed to the old +1 for match, -1 for mismatch) (see Chiaromonte et al. 2002)
  • penalizes gaps using a large gap-opening penalty and small gap-extension penalty, to reduce the over-penalization of longer gaps
  • exposes many search parameters as command line arguments (described in a .pdf file included in the source download; also listed here)

Currently, the browser is built with lastz, an improved version of blastz.

History (at UCSC)

Scott Schwartz, collaborating with Jim, developed blastz and wrapper scripts to improve its performance to the point where an alignment of the whole genomes of human and mouse completed in a reasonable amount of time on the kilokluster and aligned about 40% of the non-Un/random genome sequence in 2002. This work (both blastz and the human-mouse results) was published in 2003: abstract PDF.

In late 2002, browser staff developers took over the task of running blastz on new genomes, using the scripts developed by Scott Schwartz. In 2003, Jim developed the chaining and netting process (see Chains_Nets) for enhancing and filtering the local alignments produced by blastz. In 2005, the blastz-chain-net procedure and Scott's wrapper scripts were rolled into doBlastzChainNet.pl which has been used in most runs since then.

(Should discuss origin of subst. matrices, blastz v6 vs v7.)

Repeats and Masking

There are several distinct treatments of repeats in the context of blastz:

  1. Soft-masking: The bases annotated by RepeatMasker and TRF (with period <= 12) are put in lower-case (all other sequence is upper-case). blastz does not begin an alignment in lower-case sequence, but can extend an existing alignment through the lower-case sequence. This helps it to correctly align "ancient repeats" (shared by both species) interspersed in an aligning region, while not getting swamped by repetitive alignments.
  2. Dynamic masking: This has nothing to do with RepeatMasker/TRF but is worth a mention. Blastz has an M parameter -- I believe we have set it to 50 in most instances, but have seen it set to 0 occasionally. If blastz finds more than M matches to a part of the sequence, it will start to ignore that part of the sequence. That helps to avoid getting swamped by unannotated repeats.
  3. Abridging lineage-specific repeats: For some pairs of species (I believe it's only a subset of mammal-mammal pairs), RepeatMasker can annotate some repeats as lineage-specific, e.g. this repeat in mouse was inserted after the most recent common ancestor with human. Those lineage-specific repeats can interrupt regions that otherwise align well, and that makes them harder to detect and align. So, when aligning a pair of species for which the extra annotation is available, we remove the lineage-specific-annotated portions of repeats before invoking blastz. Then, the resulting alignments from blastz are adjusted so that their coordinates reflect the original unabridged sequence.

Navigation: back to Implementation_Notes