Bin indexing system
Introduction
The binning index system used in the genome browser is a mechanism used in concert with MySQL indexes to speed up selection of MySQL rows for genome coordinate overlapping items. This type of search is sometimes called a range request. The system as first used in the genome browser is described in: "The Human Genome Browser at UCSC" Kent, et. al. Genome Research 2002.12:996-1006, see Figure 7, quote:
We settled on a binning scheme suggested by Lincoln Stein and Richard Durbin. A simple version of this scheme is shown in Figure7. In the browser itself, we use five different sizes of bins: 128 kb, 1 Mb, 8 Mb, 64 Mb, and 512 Mb.
That initial implementation has since been enhanced by an additional level of bins to allow items of size up to 4 Gb (actually only to 2Gb given integer size limits). The new and the old system coexist together. Given an item with a chromEnd coordinate of less than or equal to 512 Mb, a bin number in the old system will be used. An item with a chromEnd coordinate greater than 512 Mb, a bin number in the new system will be used.
Since all of these bins are in sizes of powers of two, the calculation of the bin number is a simple matter of bit shifting of the chromStart and chromEnd coordinates. The C code for the bin calculation can be seen in the kent source tree in src/lib/binRange.c.
Initial implementation
Used when chromEnd is less than or equal to 536,870,912 = 229
bin numbers | bin | |||
---|---|---|---|---|
level | #bins | start | end | size |
0 | 1 | 0 | 0 | 512 Mb |
1 | 8 | 1 | 8 | 64 Mb |
2 | 64 | 9 | 72 | 8 Mb |
3 | 512 | 73 | 584 | 1 Mb |
4 | 4096 | 585 | 4680 | 128 kb |
Extended implementation
Used when chromEnd is greater than 536,870,912 = 229 and less than 2,147,483,647 = 231 - 1
bin numbers | bin | |||
---|---|---|---|---|
level | #bins | start | end | size |
0 | 1 | 4691 | 4691 | 2 Gb |
1 | 8 | 4683 | 4685 | 512 Mb |
2 | 64 | 4698 | 4721 | 64 Mb |
3 | 512 | 4818 | 5009 | 8 Mb |
4 | 4,096 | 5778 | 7313 | 1 Mb |
5 | 32,768 | 13458 | 25745 | 128 kb |
Initial implementation C code
/* This file is copyright 2002 Jim Kent, but license is hereby * granted for all use - public, private or commercial. */ static int binOffsets[] = {512+64+8+1, 64+8+1, 8+1, 1, 0}; #define _binFirstShift 17 /* How much to shift to get to finest bin. */ #define _binNextShift 3 /* How much to shift to get to next larger bin. static int binFromRangeStandard(int start, int end) /* Given start,end in chromosome coordinates assign it * a bin. There's a bin for each 128k segment, for each * 1M segment, for each 8M segment, for each 64M segment, * and for each chromosome (which is assumed to be less than * 512M.) A range goes into the smallest bin it will fit in. */ { int startBin = start, endBin = end-1, i; startBin >>= _binFirstShift; endBin >>= _binFirstShift; for (i=0; i<ArraySize(binOffsets); ++i) { if (startBin == endBin) return binOffsets[i] + startBin; startBin >>= _binNextShift; endBin >>= _binNextShift; } errAbort("start %d, end %d out of range in findBin (max is 512M)", start, end); return 0; }
Extended implementation C code
/* This file is copyright 2002 Jim Kent, but license is hereby * granted for all use - public, private or commercial. */ /* add one new level to get coverage past chrom sizes of 512 Mb * effective limit is now the size of an integer since chrom start * and end coordinates are always being used in int's == 2Gb-1 */ static int binOffsetsExtended[] = {4096+512+64+8+1, 512+64+8+1, 64+8+1, 8+1, 1, 0}; static int binFromRangeExtended(int start, int end) /* Given start,end in chromosome coordinates assign it * a bin. There's a bin for each 128k segment, for each * 1M segment, for each 8M segment, for each 64M segment, * for each 512M segment, and one top level bin for 4Gb. * Note, since start and end are int's, the practical limit * is up to 2Gb-1, and thus, only four result bins on the second * level. * A range goes into the smallest bin it will fit in. */ { int startBin = start, endBin = end-1, i; startBin >>= _binFirstShift; endBin >>= _binFirstShift; for (i=0; i<ArraySize(binOffsetsExtended); ++i) { if (startBin == endBin) return _binOffsetOldToExtended + binOffsetsExtended[i] + startBin; startBin >>= _binNextShift; endBin >>= _binNextShift; } errAbort("start %d, end %d out of range in findBin (max is 2Gb)", start, end); return 0; }
Unix Command Line, Perl, Python and Ruby
A generic Perl script for any Unix command line, by Heng Li:
http://lh3lh3.users.sourceforge.net/ucsc-mysql.shtml
For Python, see:
https://github.com/brentp/cruzdb/blob/master/cruzdb/__init__.py#L489
For Ruby, see: