Blat All Genomes: Difference between revisions
No edit summary |
No edit summary |
||
Line 130: | Line 130: | ||
Normalizing between dna and dnax? | Normalizing between dna and dnax? | ||
We could normalize them to DNA span. | We could normalize them to DNA span. | ||
<pre> | |||
(334)*3*4 = 4008 | (334)*3*4 = 4008 | ||
802*5 = 4010 | 802*5 = 4010 | ||
</pre> |
Revision as of 19:36, 1 May 2019
Background:
When we use Blat All Genomes feature on hgBlat, it simultaneously searches in parallel with multiple threads the query and possibly the reverse-complemented query against all of the default assemblies which have gfServers.
One of the reasons that it is fast is that most of those gfServer get relatively little traffic. Human and mouse might be busy, but the rest tend to be idle much of the time. So that gives us quick responses from the gfServers. On top of that, because we avoid doing full alignments that fetch the actual sequence from the target genome (.2bit) and the make an expensive dynamic programming local alignment, instead we process the raw hits that we get back from Blat locally. So it is very fast, although ultimately only produces an approximation of the full alignment that the BLAT client can make.
COMBINE ACROSS MULTIPLE QUERY-STRANDS
If the result required reverse complement of the query and resubmit, now there is a best scoring alignment for each strand. Pick the best of these and hide the other result.
SINGLE BIGGEST EXON
At first, we tried just finding the biggest scoring exon, but that did not produce satisfactory results. For instance, a primitive organism which happens to have fewer introns would score higher with a single largest exon.
COUNT ALL HITS
Then we tried counting tile hits for all chromosomes no matter what. Ends up being massive overkill whenenver the user's query happens to contain some repetitive tiles.
GENE MODEL
Breaking it up by chromosome is a start in the right direction, but we needed to just go ahead and try to construct a simple gene model. Most real genes have coordinates which are increasing and non-overlapping for the exons in a gene, and we already in BLAT use a gap of more than 1MB to terminate a gene. So it was fairly easy to construct that algorithm, and it was working fairly well. The find-the-best gene-model code is ripping fast and runs in parallel too.
I discovered that the algorithm is so effective that small remaining inaccuracy in the gene model could only be improved by loading actual DNA so it could see mismatches while scoring one possible exon over another -- however we do not want to do that because of speed issues.
TRANSLATED SPACE
For handling protein and rnax and dnax, we have multiple frames in protein space. To deal with merging the results of the 3 frames, I divided the counts by 3 since otherwise you would get nearly all of the output triplicated if there were only perfect matching tiles. Also, it keeps track of the position on the chromosome where the gene model lies, especially the span from the start of the first exon to the end of the last exon. So it pick the lowest start value and the highest end value among frames to figure out the position.
WE HIDE THE APPROXIMATE POSITION
We actually have a surprisingly accurate position where the gene lies, but it can be truncated by having a missing bit if DNA at the start and end that the fuller dynamic alignment would have found. Rather than confuse users with these approximate positions which users might have mistakenly cut-and-pasted as final results, we hide the real position and only reveal the actual chromosome since we know that for sure.
WE ONLY SHOW THE BEST ALIGNMENT FOUND
Just like blat, we find multiple gene alignments, but for the Blat All Genomes output, we only show the best single alignment. It is a little bit like the Feeling Lucky button, in that limited sense. But we would not want anyone to think that the single best alignment we show is the only one that exists.
USERS SEE ALIGNMENT WITH SMALL TILE HIT COUNT BUT FULL BLAT ALIGNMENT SHOWS NO RESULTS
In a typical case, there were a couple of tiles that were enough to get BLAT to return some DNA hits, but when it when to actually make an alignment, it found gaps and mismatches so that the final score was < 20, and then hgBlat filters those results out, showing the user nothing.
EXPECTED MAXIMUM TILE COUNTS
You can figure out from the query size and type and tileSize and stepSize, what the maximum tile count could possibly be for your query. Here we are assuming that there are NO GAPS and NO MISMATCHES, and NO REPEAT MASKING, all of which lower the score.
--
DNA
For DNA, our gfServers use tileSize=11 and stepSize=5, so if L is our DNA length of the input query, the maximum tile count T we could expect to see is:
T <= (L - (tileSize-stepSize)) / stepSize or T <= (L - (11 - 5))/5 = (L-6)/5. If L = 4017 then T <= (4017-6)/5 == 802.
tileSize-stepSize = amount of the last tile that extends beyond the previous tile.
--
PROTEIN
For TRANSLATED queries like protein, we have the tileSize=stepSize=4. Each position is 1 of 22 amino acids.
T <= L / stepSize or T <= L / 4. If L = 1339 then T <= 1339/4 == 334.
--
RNAX, DNAX
DNAX is just RNAX repeated with a reverse-complemented query. So the results are the same for both.
You have to convert the query to protein though, which means dividing the query DNA Length L by 3.
T <= (L/3)/stepSize = (L/3)/4 = L/12 T <= 4017/12 = 334
---
Normalizing between dna and dnax? We could normalize them to DNA span.
(334)*3*4 = 4008 802*5 = 4010