Wiggle: Difference between revisions
No edit summary |
No edit summary |
||
Line 85: | Line 85: | ||
[[Catagory: | [[Catagory:Data Types]] |
Revision as of 20:41, 10 April 2006
File formats for .wig and .wib files:
The combination of the .wig and .wib files is used to compress real-valued data into a single byte per value. Corresponding genome position is recorded for each data value. A single data value can be applied to a single genome base position, or can be applied to a span of genome base positions.
The .wig files are the SQL row data values found in the database.
The .wib files are binary files which are interpreted by the data found in the .wig files.
The wiggle SQL table definition is:
CREATE TABLE wiggle ( bin smallint(5) unsigned not null, # bin scheme for indexing chrom varchar(255) not null, # Human chromosome or FPC contig chromStart int unsigned not null, # Start position in chromosome chromEnd int unsigned not null, # End position in chromosome name varchar(255) not null, # Name of item span int unsigned not null, # each value spans this many bases count int unsigned not null, # number of values in this block offset int unsigned not null, # offset in File to fetch data file varchar(255) not null, # path name to .wib data file lowerLimit double not null, # lowest data value in this block dataRange double not null, # lowerLimit + dataRange = upperLimit validCount int unsigned not null, # number of valid data values in this block sumData double not null, # sum of the data points, for average and stddev calculation sumSquares double not null, # sum of data points squared, used for stddev calculation INDEX(chrom(8),bin) );
The .wib data bytes are interpreted via the SQL row entries in the .wig files. A data binning scheme is used to encode the real-valued data. To de-code the .wib data bytes:
- reading the file specified in the "file" pathname
- seek to position "offset"
- read a block of "count" bytes into array: unsigned char data[]
- The first data byte corresponds to genome position "chromStart"
- The next data byte genome position is: "chromStart" + "span"
- data byte value of 128 indicates "NO DATA" at this position
- data byte values 0 to 127 are expanded to double types via:
unsigned int chromPosition = chromStart; for (i = 0; i < count; ++i, chromPosition += span) { if ( data[i] < 128 ) { double value = lowerLimit+(dataRange*((double)data[i]/127.0)); printf ("%u\t%g\n", chromPosition, value); } }
- data byte values in the range [129 : 255] are reserved for future use.
- A single data value is interpreted to apply to genome positions: chrom:chromPosition-(chromPosition + span)
Known Issues:
This compression algorithm reduces the precision of the input data. For any particular block of data specified by a row from the .wig file, the data resolution is (dataRange / 127).
When a single block of data contains positive and negative values, the value of zero (0.0) is not exact. Very small positive or negative data values placed into the same bin as the value zero may be de-coded into values of the opposite sign, and input data values of zero will not be de-coded to 0.0.
Sparse irregular data can cause excess amounts of "NO DATA" bytes being inserted into the .wib file. The most efficient use of data bytes in the .wib files is when all data values align to "span" offset positions.
The bed format input is merely for compatibility convenience with existing bed format files. It is not an efficient use of the data encoding method. Each bed element specified causes data values to be defined for each base in the range of positions. Perhaps future versions of this format can address this situation with a run-length encoding of the data values.