Monitoring Tasks Notes
This page is intended to document procedures and notes for various QA monitoring tasks.
Cronjob: Results from checkGbibMd5.sh
Hiram generally requests the push of vXXX gBiB Store Push (push request example). This push is usually done soon after the Tuesday release of CGIs to the RR/world. However, if this push is not done before this script runs, then you may see the md5 mismatch. It is recommended to wait until the end of the week (and possibly wait until the end of Friday following the release) to give Hiram time to do the push. Check the push request group for the latest store push. If the md5sums still do not match after the push of the GBiB to the store, then let Hiram know.
The /cluster/home/qateam/bin/scripts/checkGbibMd5.sh script checks for matching 'modify' times between beta's hgTracks and RR's hgTracks:
ssh qateam@hgwbeta stat -c %y /usr/local/apache/cgi-bin/hgTracks 2017-03-20 10:31:02.000000000 -0700
ssh qateam@hgw1 stat -c %y /usr/local/apache/cgi-bin/hgTracks 2017-03-20 10:31:02.000000000 -0700
If those dates match, the md5sums are compared between the local gbibBeta with the version on the store, which should also match, but in this case they don't match.
md5sum /usr/local/apache/htdocs/gbib/gbibBeta.zip | awk '{print $1}' 14e5f65fd19ecf43af05a313f884d26c
curl -s https://genome-store.ucsc.edu/media/products/gbib.zip | md5sum | awk '{print $1}' 0a90536be4f4e2fdb3e2d865762db818
Cronjob: Results from checkMetaAday.csh
This is one of the monitoring tasks that looks at metaData such as in hgcentral on hgwdev, hgwbeta, and the RR and notifies people with an email that they must investigate to see if there are differences. Differences are then corrected. The output is put into genecats: http://genecats.cse.ucsc.edu/qa/test-results/metadata/ where each database gets an entry.
Email Example
checkMetaAday.csh mm8
database = mm8
0 dbDb.mm8.hgcentralbetaOnly 0 dbDb.mm8.hgcentralOnly 1 dbDb.mm8.common
0 blatServers.mm8.hgcentralbetaOnly 0 blatServers.mm8.hgcentralOnly 2 blatServers.mm8.common
0 defaultDb.mm8.hgcentralbetaOnly 0 defaultDb.mm8.hgcentralOnly 0 defaultDb.mm8.common
0 genomeClade.mm8.hgcentralbetaOnly 0 genomeClade.mm8.hgcentralOnly 1 genomeClade.mm8.common
0 liftOverChain.mm8.hgcentralbetaOnly
1 liftOverChain.mm8.hgcentralOnly <---This should be a zero. There is "1" row difference. Need to fix.
52 liftOverChain.mm8.common
details in http://genecats.cse.ucsc.edu/qa/test-results/metadata/details
Process Example
Problem: The file below is in rr but not in beta
ornAna1 mm8 /gbdb/ornAna1/liftOver/ornAna1ToMm8.over.chain.gz 0.1 0 0 Y 1 N
Solution: Add row to beta liftOverChain table
First, double check: rr
hgsql -h genome-centdb -Ne "SELECT * FROM liftOverChain WHERE fromDb = 'ornAna1' AND toDb = 'mm8'" hgcentral +---------+-----+---------------------------------------------------+-----+---+---+---+---+---+ | ornAna1 | mm8 | /gbdb/ornAna1/liftOver/ornAna1ToMm8.over.chain.gz | 0.1 | 0 | 0 | Y | 1 | N | +---------+-----+---------------------------------------------------+-----+---+---+---+---+---+
beta
hgsql -h hgwbeta -Ne "SELECT * FROM liftOverChain WHERE fromDb = 'ornAna1' AND toDb = 'mm8'" hgcentralbeta
No results. No entries match. Need to move the row that exists in the rr table into the beta table.
First, make a file of the row which is on rr but not on beta.
hgsql -h genome-centdb -Ne "SELECT * FROM liftOverChain WHERE fromDb = 'ornAna1' AND toDb = 'mm8'" hgcentral > chain.dev
Next, load the file into beta:
hgsql -h hgwbeta -e "LOAD DATA LOCAL INFILE 'chain.dev' INTO TABLE liftOverChain" hgcentralbeta
Look in beta to see if the contents of the file make it into the table:
hgsql -h hgwbeta -Ne "SELECT * FROM liftOverChain WHERE fromDb = 'ornAna1' AND toDb = 'mm8'" hgcentralbeta +---------+-----+---------------------------------------------------+-----+---+---+---+---+---+ | ornAna1 | mm8 | /gbdb/ornAna1/liftOver/ornAna1ToMm8.over.chain.gz | 0.1 | 0 | 0 | Y | 1 | N | +---------+-----+---------------------------------------------------+-----+---+---+---+---+---+
Looks good. CheckMetaAday again, for both ornAna1 and mm8
checkMetaAday.csh mm8
database = mm8
0 dbDb.mm8.hgcentralbetaOnly 0 dbDb.mm8.hgcentralOnly 1 dbDb.mm8.common
0 blatServers.mm8.hgcentralbetaOnly 0 blatServers.mm8.hgcentralOnly 2 blatServers.mm8.common
0 defaultDb.mm8.hgcentralbetaOnly 0 defaultDb.mm8.hgcentralOnly 0 defaultDb.mm8.common
0 genomeClade.mm8.hgcentralbetaOnly 0 genomeClade.mm8.hgcentralOnly 1 genomeClade.mm8.common
0 liftOverChain.mm8.hgcentralbetaOnly 0 liftOverChain.mm8.hgcentralOnly <---Looks good! 53 liftOverChain.mm8.common <--up by 1, good!
SLA Monitoring & Reporting
In Jan 2017, these Service Level Agreement data were migrated to a complicated Google Spreadsheet:
- SLA Outage Report Worksheet.
- Current manager of this spreadsheet: BrianL
- Please contact the current manager of the spreadsheet to enter/modify outages.
- The spreadsheet has a "README" tab to explain the spreadsheet format and best practices.
The old (unused) way to report outages was by adding it to an html table, SLA.html.
Cronjob: Results from realTime.csh (previously known as gbLoaded)
Previously, this job checked table times for the table gbLoaded in the database of the day. The problem with that is that the genome-asia machine doesn't like the pushed gbLoaded table because there's something different about the timestamp field type on that machine. No other genbank table uses that field type, and nothing uses gbLoaded except for this job. Braney proposed a switch to using xenoRefGene instead of gbLoaded for this check.
- Job is tracked in the genecats repository under qa/crontabs).
- Brian Lee updated this on hgwdev 3/30/17 and in the repository with commit 795b903 changing gbLoaded to xenoRefGene
- Note that there are a few assemblies that don't have this table (xenTro3, fr1, fr2, fr3, eboVir3, dm2). When the database of the day is on those assemblies, the check will provide no output.
- QA should check that the update times for each server are all close in time, within a week. Set 1 (dev & beta) are usually the same times, and Set 2 (rr/euro/asia) are usually on the same day, and the two sets are generally vary under a week.
Example output:
[qateam@hgwdev /cluster/home/qateam] /cluster/bin/scripts/realTime.csh `/cluster/bin/scripts/databaseAday.csh today` xenoRefGene verbose xenoRefGene ============= dev 2017-03-28 09:26:59 beta 2017-03-28 09:26:59 rr 2017-03-22 22:49:49 euro 2017-03-23 06:49:49 asia 2017-03-23 14:49:49
Cronjob: Results from checkTableStatus.csh " TABLE STATUS dump" emails
This cronjob checks the health of a Table Status dump instigated by Mark for checking the health of GenBank tracks, but also that table dump is used by another QA tool, updateTimes.csh. The only thing that matters from this cronjob is that the RR date looks recent:
- TABLE STATUS files were last dumped:
- hgwdev: 2016.10.24
- hgwbeta: 2016.10.24
- rr: 2019.01.20
We want the RR line to look good, rr: 2019.01.20 with a date that is recent. Here is where that file is in 2019 ls -lrt /hive/data/outside/genbank/var/tblstats/hgnfs1/ | tail
at one time it was at /cluster/data/genbank/var/tblstats/hgnfs1/
instead.
We want that up-to-date because the script /cluster/bin/scripts/updateTimes.csh calls /cluster/bin/scripts/getTableStatus.csh for dev and beta, but for the RR it calls /cluster/bin/scripts/getRRtableStatus.csh
# gets the status of any table from an RR database # using mark's genbank dumps.
And that uses these files to check when the last date is from the RR. This is to save abusive hgcentral queries against the RR, I believe.
UCSC Entrez LinkOut
A rare non-cron task: making any changes required to our Entrez LinkOut files. Changes are only requested once every few years, but they require some fiddling with our XML files and decisions about how we want to present ourselves on the various Entrez websites (e.g. which types of records are most important for us to display the UCSC link on). An example is described here: https://groups.google.com/a/soe.ucsc.edu/d/msg/browser-qa/8F7RU4xiMvc/d7laysnXNG0J
Entrez LinkOut sends requests for changes (and statistics) to the browser-qa email address. See their normal statistics update emails: https://groups.google.com/a/soe.ucsc.edu/forum/#!searchin/browser-qa/%22LinkOut$20team$20%22%7Csort:date
check that blat servers are running ok
See the BlatServer Backup Page. There is a way to check the blat error logs (located in /scratch/$db/gfServer.log) when you get a report. For example, let's say you get this report:
Couldn't read string length Error reading status information from blat1d:17779 error 255 on mm10 blat1d:17779 Summary: problems: mm10 blat1d:17779
You can note that this is on blat1d for mm10 and then go to the blat1d machine:
ssh qateam@blat1d
Once you are connected (You might have to say YES as it is a connection you probably have not made before). You can look for the specific log with the specific day. In this case the error came in on 2019/03/07 and it was for the mm10 gfServer:
grep "2019/03/07" /scratch/mm10/gfServer.log | less
Then you will want to hit "f" to go forward in the log until around the time of the incident, in this case it was reported around 4am, but the incident that brought down the server happened around midnight:
2019/03/07 00:06:31: info: gfServer version 36x2 on host blat1d, port 17779 connection from 132.76.220.198 2019/03/07 00:06:31: debug: 0ddf270562684f29query 7028 2019/03/07 07:02:32: info: gfServer version 36x2 on host blat1d, port 17779 (2019-03-07 07:02)
In this case, the most that can be gleaned is that the IP is 132.76.220.198, one would hope that there would be a history of the query too, but that didn't make it into the log.
Jorge adds a quick check is just to search the log for "queries" usually in a line that says, "Server ready for queries!" This line is printed after a blat server restart. If you find that line, the last query before the server restart was the one that probably killed the server...