Apache error log output: Difference between revisions

From Genecats
Jump to navigationJump to search
(Adding Lou's tips about seeing an IP's actions)
No edit summary
Line 88: Line 88:
[[Category:Browser QA]]
[[Category:Browser QA]]
[[Category:Browser QA CGI]]
[[Category:Browser QA CGI]]
== Ways to track Genome Browser usage stats: ==
===Using cooccurrenceCounts.c and filterTop.py===
This method was deduced by Chris Lee and Max Haussler. It uses a combination of coocurrenceCounts.c which created a usage matrix out of the error logs, and then filterTop.py which filters out a certain number of tracks from the matrix to make the data more digestible.
====cooccurrenceCounts====
This program can be found in the following directory:
<pre>
kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts</pre>
You may have to do a 'make' in the directory to compile it. If you get an error, it may mean you are missing a x86_64/ directory in your bin. Hiram was able to quickly diagnose this. The program can be ran with no arguments to show a usage statement that includes examples of how to run it:
<pre>
$ cooccurrenceCounts
coocurrenceCounts - Experiment to build a coocurrence matrix out of lists.
usage:
  cooccurrenceCounts -db=dbName inFile outFile
options:
  -db=XXX    = ucsc db name to filter on
  -tc=XXX    = min (ie >=) filter for track counts
  -hc=XXX    = min (ie >=) filter for hgsid counts
  -hgsids    = don't make track matrix, instead print a tab separated
                file of hgsids:hgsidUseCount and track usages, for example:
                hgsid:hgsidUseCount trackCount1 trackCount2 ...
  -verbose=X = logging level, log messages print to stderr
...
</pre>
In order to run the program, you must first put together a list of what error logs you are interested in extracting from. The best way to do this is to use makeLogSymLinks.sh. First you will want to make a directory in hive:
<pre>
$ mkdir /hive/users/lrnassar/ErrorLogs</pre>
Next you will run makeLogSymLinks.sh to populate the directory with symlinks to all the error logs of interest (it can also be ran with no arguments for a usage statement). I recommend running it from Within your new directory:
<pre>
$ cd /hive/users/lrnassar/ErrorLogs
$ makeLogSymLinks.sh 20180107 12 error
</pre>
This will create symlinks to all the error logs starting with the 7th of January (the date has to match a file in /hive/data/inside/wwwstats/RR), and going forward from there 12 months. Also I state I want the error logs, which is what we extract the data from. This particular command created symlinks to ~200 files (all of 2018).
Next you will want to run cooccurrenceCounts. It is recommended you make a new directory for the output, in this case I chose to do it in (/hive/users/lrnassar/MatrixData).
<pre>
$ /hive/users/lrnassar/MatrixData
$ cooccurrenceCounts -db=hg19 -hc=400 -tc=100 ../ErrorLogs/ out.txt -verbose=2
</pre>
The -hc flag stipulates how many times an hgsid needs to be present for the data not to be filtered out. In this case I chose a very high number to try and filter out bots, as I was using the entire 2018's worth of data. Worth noting the example usage statements are not as high. The -tc filters out tracks that do not occur very often. Likewise the chosen cutoff is higher than the examples. For reference, the above command took about 15 minutes to run.

Revision as of 23:32, 1 April 2019

Here are some examples of errors we often see but that are of no concern:

This is seen right after a push of new CGIs and is caused by browsers having a stale version of the CGIs/css file. A bunch of these will show up right after the push, but they should subside after several minutes as browser grab new versions.

 Cannot find correct version of file 'HGStyle.css'; this is due to an installation error, referer: http://hgw0.gi.ucsc.edu/
 File does not exist: /data/apache/htdocs/js/jquery.imgareaselect-v291.js, referer: http://genome.ucsc.edu/cgi-bin/hgTracks

We also see these often but they tend to right themselves as the load on the browser waxes and wanes.

 Too many popAbortHandlers, referer: http://genome.ucsc.edu/cgi-bin/hgTables?position=chr4:150237197-150248873&hgsid=353372405&refGene=pack&hgFind.matches=NM_023119,

How to look at logs

When you are on a machine you can look at the logs by following this file:

/usr/local/apache/logs/error_log

Older logs with names like this:

/usr/local/apache/logs/error_log.1
/usr/local/apache/logs/error_log.2

Example tail command:

tail -f /usr/local/apache/logs/error_log | grep -v "CGI_TIME\|trackLog"

How to find historic logs

If you wanted to look at historical stuff there is a place here on hive:

/hive/data/inside/wwwstats/RR/2018/
/hive/data/inside/wwwstats/RR/2019/

Let's say you wanted to look at specific events on Feb. 14th on the hgw1 machine for 2019:

zgrep "Feb 14" /hive/data/inside/wwwstats/RR/2019/hgw1/error_log.20190211.gz | less

Ask Matt for more ways to use these logs, such as building a temporary directory of simlinks to the logs for specific date ranges to do more complicated querries.


Example of accessing /hive/data/inside/wwwstats/RR/ for IP

Error log location for hgw2:

/hive/data/inside/wwwstats/RR/2019/hgw2

The last two directories are for $year and $machine

Within that directory:

-rw-rw-r--  1 hiram genecats 498M Mar 28 10:40 access_log
-rw-rw-r--  1 hiram genecats 891M Mar 28 10:40 error_log
-rw-r--r--  1 hiram protein   54M Jan 14 03:10 access_log.20190106.gz
-rw-rw-r--  1 hiram genecats  48M Jan 20 03:30 access_log.20190114.gz
-rw-rw-r--  1 hiram genecats  58M Jan 28 03:41 access_log.20190120.gz
...

There is an access_log, which is the latest log (within the last week). Same for the error_log. Once a log is >week old, it gets compressed and the file name is 2019-month-day(start of week). Ex. access_log.20190120.gz.


If you wanted to research what a recent offending IP was doing, you would check the different machines, in this case we saw there is some activity on hgw2, and do the following:

$cat access_log | awk '{print $1}' | sort | uniq -c | sort -r | head

  33386 54.82.172.170
  28818 130.91.194.123
  16993 130.223.121.12 (Offender) (16993 hits on hgw2)

Ex. Summarizing specific 130.223.121.12 IP actions

$cat access_log | grep "130.223.121.12" | awk '{print $7}' | cut -f 3 -d '/' | cut -f 1 -d "?" | sort | uniq -c | sort -r | head

  16905 hgTables
     18 hgt
     14 hgTracks

So this 130.223.121.12 IP is hammering hgTables.

Ex. Checking specific 130.223.121.12 IP requests

$cat access_log | grep "130.223.121.12" | head -n 3

130.223.121.12 - - [25/Mar/2019:08:12:27 -0700] "POST /cgi-bin/hgTables? HTTP/1.1" 200 1222212 "-" "Python-urllib/2.7" 1257605 microseconds
130.223.121.12 - - [25/Mar/2019:08:12:30 -0700] "POST /cgi-bin/hgTables? HTTP/1.1" 200 167045 "-" "Python-urllib/2.7" 2063636 microseconds
130.223.121.12 - - [25/Mar/2019:08:12:31 -0700] "POST /cgi-bin/hgTables? HTTP/1.1" 200 681933 "-" "Python-urllib/2.7" 2170569 microseconds

The results show Python-urllib/2.7 which shows they are using python to programmatically query, and the POST is a different way to input parameters without using the URL, so we don't know what they're searching for. We do see that they are doing multiple queries a minute without waiting for a response, which is more evident as they currently have an almost 60 minute bottleneck.

Ways to track Genome Browser usage stats:

Using cooccurrenceCounts.c and filterTop.py

This method was deduced by Chris Lee and Max Haussler. It uses a combination of coocurrenceCounts.c which created a usage matrix out of the error logs, and then filterTop.py which filters out a certain number of tracks from the matrix to make the data more digestible.

cooccurrenceCounts

This program can be found in the following directory:

kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts

You may have to do a 'make' in the directory to compile it. If you get an error, it may mean you are missing a x86_64/ directory in your bin. Hiram was able to quickly diagnose this. The program can be ran with no arguments to show a usage statement that includes examples of how to run it:

$ cooccurrenceCounts 
coocurrenceCounts - Experiment to build a coocurrence matrix out of lists.
usage:
   cooccurrenceCounts -db=dbName inFile outFile
options:
   -db=XXX    = ucsc db name to filter on
   -tc=XXX    = min (ie >=) filter for track counts
   -hc=XXX    = min (ie >=) filter for hgsid counts
   -hgsids    = don't make track matrix, instead print a tab separated 
                file of hgsids:hgsidUseCount and track usages, for example: 
                hgsid:hgsidUseCount trackCount1 trackCount2 ... 
   -verbose=X = logging level, log messages print to stderr
...

In order to run the program, you must first put together a list of what error logs you are interested in extracting from. The best way to do this is to use makeLogSymLinks.sh. First you will want to make a directory in hive:

$ mkdir /hive/users/lrnassar/ErrorLogs

Next you will run makeLogSymLinks.sh to populate the directory with symlinks to all the error logs of interest (it can also be ran with no arguments for a usage statement). I recommend running it from Within your new directory:

$ cd /hive/users/lrnassar/ErrorLogs
$ makeLogSymLinks.sh 20180107 12 error

This will create symlinks to all the error logs starting with the 7th of January (the date has to match a file in /hive/data/inside/wwwstats/RR), and going forward from there 12 months. Also I state I want the error logs, which is what we extract the data from. This particular command created symlinks to ~200 files (all of 2018).

Next you will want to run cooccurrenceCounts. It is recommended you make a new directory for the output, in this case I chose to do it in (/hive/users/lrnassar/MatrixData).

$ /hive/users/lrnassar/MatrixData
$ cooccurrenceCounts -db=hg19 -hc=400 -tc=100 ../ErrorLogs/ out.txt -verbose=2

The -hc flag stipulates how many times an hgsid needs to be present for the data not to be filtered out. In this case I chose a very high number to try and filter out bots, as I was using the entire 2018's worth of data. Worth noting the example usage statements are not as high. The -tc filters out tracks that do not occur very often. Likewise the chosen cutoff is higher than the examples. For reference, the above command took about 15 minutes to run.