Checking RR status through hgTracksRandom: Difference between revisions
No edit summary |
(hgwbeta) |
||
(6 intermediate revisions by 2 users not shown) | |||
Line 59: | Line 59: | ||
The job runs the program hgTracksRandom, on a file called /hive/users/qateam/perf/machines. You can try running the program yourself anytime . . . it's just a c program in the kent source tree. The output goes into a file called /hive/users/qateam/perf/hgTracksRandom.log on hgwdev. | The job runs the program hgTracksRandom, on a file called /hive/users/qateam/perf/machines. You can try running the program yourself anytime . . . it's just a c program in the kent source tree. The output goes into a file called /hive/users/qateam/perf/hgTracksRandom.log on hgwdev. | ||
The message you get from cron when an error is happening like the above situations will likely be a note that doesn't tell you much. '' If you get a cron error for hgTracksRandom, it is the prompt for you to check the log file to see where the program may have gotten stuck, indicating the machines on the RR to check and see if they are loading. '' '''If they are not working, alert cluster-admin and browser-qa'''. See more at [http://genomewiki.cse.ucsc.edu/genecats/index.php/RR_Down:_Sending_Alert_Messages_about_Genome_Browser_Being_Offline RR_Down:_Sending_Alert_Messages_about_Genome_Browser_Being_Offline] | The message you get from cron when an error is happening like the above situations will likely be a note that doesn't tell you much. '' If you get a cron error for hgTracksRandom, it is the prompt for you to check the log file to see where the program may have gotten stuck, indicating the machines on the RR to check and see if they are loading. '' '''If they are not working, alert cluster-admin and browser-qa & browser-dev'''. See more at [http://genomewiki.cse.ucsc.edu/genecats/index.php/RR_Down:_Sending_Alert_Messages_about_Genome_Browser_Being_Offline RR_Down:_Sending_Alert_Messages_about_Genome_Browser_Being_Offline] | ||
Every once in a while, you will get output from this cron job that doesn't indicate a real problem with the RR. For instance, after the power was out, hgwbeta couldn't find the SQL host hgwbeta, and this cron job started complaining, even though the RR was fine. Also, it may not be alarming if a machine is unreachable but it is not currently in the RR. You can always tell what is in the RR with the host command, e.g.: <pre> [rhead@hgwbeta ~]$ host genome.ucsc.edu </pre>So, this is a very imperfect warning system that there *may* be a problem with the RR. (This program is ostensibly for the purpose of monitoring response times, but it functions as a warning that one or more machines are not responding, too. Btw, if you ever want to look at response time logs in a nice graphical way, the admins have pretty [https://kkstore.cse.ucsc.edu/cacti/graph_view.php cacti graphs] available. Ask around if you don't know the password.) | |||
==The machines being tested== | |||
This file <code>/hive/users/qateam/perf/machines</code> defines which machines are being tested. Very rarely we rotate the machines in the RR and only then does this file need to be changed. And the order of the machines is the inverse order that hgTracksRandom checks sites: | |||
<pre> | |||
$ cat /hive/users/qateam/perf/machines | |||
genome-euro.ucsc.edu | |||
genome-asia.ucsc.edu | |||
hgw6.cse.ucsc.edu | |||
hgw5.cse.ucsc.edu | |||
hgw4.cse.ucsc.edu | |||
hgw3.cse.ucsc.edu | |||
hgw2.cse.ucsc.edu | |||
hgw1.cse.ucsc.edu | |||
hgw0.cse.ucsc.edu | |||
</pre> | |||
==Checking the error logs== | |||
Check out the [http://genomewiki.ucsc.edu/genecats/index.php/Apache_error_log_output Apache error log page] to learn more about looking through the error logs to investigate what a user might be doing. | |||
[[Category:Browser QA]] | [[Category:Browser QA]] | ||
[[Category:Browser Development]] | [[Category:Browser Development]] |
Latest revision as of 23:58, 4 June 2019
You can check the RR status by seeing the output of hgTracksRandom by running the following command from hgwdev:
tail -100 /hive/users/qateam/perf/hgTracksRandom.log
(Results prior to July 10, 2013 are in /hive/users/qateam/perf/save.hgTracksRandom.log.)
Since 2006 every 15 minutes trusty hgTracksRandom has been randomly testing output on the RR and appending the results to the hgTracksRandom.log file, where you will see normal output like:
September 11, 2012 16:50 hg19 chr1:26067320-26167320 hgwbeta.cse.ucsc.edu 1257 hgw0.cse.ucsc.edu 1335 hgw1.cse.ucsc.edu 1251 hgw2.cse.ucsc.edu 1220 hgw3.cse.ucsc.edu 1386 hgw4.cse.ucsc.edu 1519 hgw5.cse.ucsc.edu 1679 hgw6.cse.ucsc.edu 1765 hgw7.cse.ucsc.edu 48926 <--- hgw8.cse.ucsc.edu 1650 ------------------------------
The numbers indicate how many milliseconds it took to load the position specified in hgTracks on a particular machine. Numbers are sometimes high (indicated by arrows), and that's fine. When one of the hgw 1-8 are missing, it is reason to investigate further by going to that machine online and testing functionality. For example abnormal output would look like this:
September 11, 2012 08:05 hg19 chr1:100683630-100783630 hgwbeta.cse.ucsc.edu 1628 hgw0.cse.ucsc.edu 1276 hgw1.cse.ucsc.edu 1141 hgw2.cse.ucsc.edu 1143 hgw3.cse.ucsc.edu 1337 hgw4.cse.ucsc.edu 1584 hgw5.cse.ucsc.edu 1621 hgw6.cse.ucsc.edu 1747 hgw7.cse.ucsc.edu 3178
Notice that hgw8 is missing from the list, and the nice little "-----" divider line at the end. That's because the program didn't get a response from hgw8 and stopped, and then it ran again 15 minutes later. If hgw4 was down instead, there wouldn't be any output after the hgw3 line, something like:
September 11, 2012 08:20 hg19 chr1:100683630-100783630 hgwbeta.cse.ucsc.edu 1628 hgw0.cse.ucsc.edu 1276 hgw1.cse.ucsc.edu 1141 hgw2.cse.ucsc.edu 1143 hgw3.cse.ucsc.edu 1337
This job runs every 15 minutes on the qateam crontab on hgwdev. Here's the relevant snippet (which you can also see in the genecats source tree, if you have it, in genecats/qa/crontabs/hgwdev.crontab):
MAILTO=kuhn,rhead,pauline,katrina,brianlee,braney,luvina,gary,ann,steve,jcasper # performance log 5,20,35,50 * * * * hgTracksRandom /hive/users/qateam/perf/machines >> /hive/users/qateam/perf/hgTracksRandom.log
The job runs the program hgTracksRandom, on a file called /hive/users/qateam/perf/machines. You can try running the program yourself anytime . . . it's just a c program in the kent source tree. The output goes into a file called /hive/users/qateam/perf/hgTracksRandom.log on hgwdev.
The message you get from cron when an error is happening like the above situations will likely be a note that doesn't tell you much. If you get a cron error for hgTracksRandom, it is the prompt for you to check the log file to see where the program may have gotten stuck, indicating the machines on the RR to check and see if they are loading. If they are not working, alert cluster-admin and browser-qa & browser-dev. See more at RR_Down:_Sending_Alert_Messages_about_Genome_Browser_Being_Offline
Every once in a while, you will get output from this cron job that doesn't indicate a real problem with the RR. For instance, after the power was out, hgwbeta couldn't find the SQL host hgwbeta, and this cron job started complaining, even though the RR was fine. Also, it may not be alarming if a machine is unreachable but it is not currently in the RR. You can always tell what is in the RR with the host command, e.g.:
[rhead@hgwbeta ~]$ host genome.ucsc.edu
So, this is a very imperfect warning system that there *may* be a problem with the RR. (This program is ostensibly for the purpose of monitoring response times, but it functions as a warning that one or more machines are not responding, too. Btw, if you ever want to look at response time logs in a nice graphical way, the admins have pretty cacti graphs available. Ask around if you don't know the password.)
The machines being tested
This file /hive/users/qateam/perf/machines
defines which machines are being tested. Very rarely we rotate the machines in the RR and only then does this file need to be changed. And the order of the machines is the inverse order that hgTracksRandom checks sites:
$ cat /hive/users/qateam/perf/machines genome-euro.ucsc.edu genome-asia.ucsc.edu hgw6.cse.ucsc.edu hgw5.cse.ucsc.edu hgw4.cse.ucsc.edu hgw3.cse.ucsc.edu hgw2.cse.ucsc.edu hgw1.cse.ucsc.edu hgw0.cse.ucsc.edu
Checking the error logs
Check out the Apache error log page to learn more about looking through the error logs to investigate what a user might be doing.