Apache error log output: Difference between revisions

From Genecats
Jump to navigationJump to search
Line 89: Line 89:
[[Category:Browser QA CGI]]
[[Category:Browser QA CGI]]


== Ways to track Genome Browser usage stats: ==
== Ways to track Genome Browser usage stats ==
 
It is worth noting that the first method described (coocuurenceCounts.c + filterTop.py) make a matrix of track usage, and can therefore provide relational information between the tracks. For general statistics (e.g. total track usage, total db used, usage by month) the second method (generateUsageStats.py) is recommended as it is more streamlined.


===Using cooccurrenceCounts.c and filterTop.py===
===Using cooccurrenceCounts.c and filterTop.py===


This method was deduced by Chris Lee and Max Haussler. It uses a combination of coocurrenceCounts.c which created a usage matrix out of the error logs, and then filterTop.py which filters out a certain number of tracks from the matrix to make the data more digestible. Most of this explains how to find which tracks are used commonly with each other, however at the end is an example of how to get general usage stats.
This method was deduced by Chris Lee and Max Haussler. It uses a combination of coocurrenceCounts.c which creates a usage matrix out of the error logs, and then filterTop.py which filters out a certain number of tracks from the matrix to make the data more digestible. Keep in mind every instance of an hgsid that meets the threshold is counted upon refresh. This means that a user browsing heavily will be represented more heavily (as opposed to generateUsageStats.py). Most of this explains how to find which tracks are used commonly with each other, however at the end is an example of how to get general usage stats.


====cooccurrenceCounts====
====cooccurrenceCounts====
Line 239: Line 241:
$HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py -n 100 --default-track-list /hive/users/lrnassar/MatrixData/hg19Defaults.txt 2018UsageStatsNoPrefix.txt top100Tracks2018NoDefaults.txt
$HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py -n 100 --default-track-list /hive/users/lrnassar/MatrixData/hg19Defaults.txt 2018UsageStatsNoPrefix.txt top100Tracks2018NoDefaults.txt
</pre>
</pre>
===Using generateUsageStats.py===
This method was put together my Matthew Speir. It generates usage statistics for dbs, tracks, and hubs tracks using Apache error_log files. This script only counts an hgsid use a single time (as opposed to the previous method), that is to say heavy browser users are only represented a single time if their hgsid remains the same.
This program can be found in the following directory:
<pre>
kent/src/hg/logCrawl/dbTrackAndSearchUsage/generateUsageStats.py</pre>
The program can be ran with no arguments to show a usage statement that includes examples of how to run it:
<pre>
$ ./generateUsageStats.py
usage: generateUsageStats.py [-h] [-f FILENAME] [-d DIRNAME] [-p] [-m] [-j]
                            [-t] [-o OUTDIR]
Generates usage statistics for dbs, tracks, and hubs tracks using Apache error_log files
optional arguments:
  -h, --help            show this help message and exit
  -f FILENAME, --fileName FILENAME
                        input file name, must be space-separated Apache
                        error_log file
  -d DIRNAME, --dirName DIRNAME
                        input directory name, files must be space-separated
                        error_log files. No other files should be present in
                        this directory.
  -p, --perMonth        output file containing info on db/track/hub track use
                        per month
  -m, --monthYear      output file containing month/year pairs (e.g. "Mar
                        2017")
  -j, --jsonOut        output json files for summary dictionaries
  -t, --outputDefaults  output file containing info on default track usage for
                        top 15 most used assemblies
  -o OUTDIR, --outDir OUTDIR
                        directory in which to place output files
</pre>
In order to run the program, you must first put together a list of what error logs you are interested in extracting from. The best way to do this is to use makeLogSymLinks.sh. First you will want to make a directory in hive:
<pre>
$ mkdir /hive/users/lrnassar/ErrorLogs</pre>
Next you will run makeLogSymLinks.sh to populate the directory with symlinks to all the error logs of interest (it can also be ran with no arguments for a usage statement). I recommend running it from Within your new directory:
<pre>
$ cd /hive/users/lrnassar/ErrorLogs
$ makeLogSymLinks.sh 20180107 12 error
</pre>
This will create symlinks to all the error logs starting with the 7th of January (the date has to match a file in /hive/data/inside/wwwstats/RR), and going forward from there 12 months. Also I state I want the error logs, which is what we extract the data from. This particular command created symlinks to ~200 files (all of 2018).
Next you can call the python script to generate the usage stats. I ran the program in its own directory using the following command:
<pre>
$ pwd
/hive/users/lrnassar/UsageStatsPython
~/kent/src/hg/logCrawl/dbTrackAndSearchUsage/generateUsageStats.py -d ../ErrorLogs/ -o . -pm
</pre>
The -d flag directs the script to the directory with the error logs, -o designates the output directory, -p tells it to include 'perMonth' numbers (as opposed to just year), and -m tells it to include month/year pairs in the output.

Revision as of 18:57, 16 April 2019

Here are some examples of errors we often see but that are of no concern:

This is seen right after a push of new CGIs and is caused by browsers having a stale version of the CGIs/css file. A bunch of these will show up right after the push, but they should subside after several minutes as browser grab new versions.

 Cannot find correct version of file 'HGStyle.css'; this is due to an installation error, referer: http://hgw0.gi.ucsc.edu/
 File does not exist: /data/apache/htdocs/js/jquery.imgareaselect-v291.js, referer: http://genome.ucsc.edu/cgi-bin/hgTracks

We also see these often but they tend to right themselves as the load on the browser waxes and wanes.

 Too many popAbortHandlers, referer: http://genome.ucsc.edu/cgi-bin/hgTables?position=chr4:150237197-150248873&hgsid=353372405&refGene=pack&hgFind.matches=NM_023119,

How to look at logs

When you are on a machine you can look at the logs by following this file:

/usr/local/apache/logs/error_log

Older logs with names like this:

/usr/local/apache/logs/error_log.1
/usr/local/apache/logs/error_log.2

Example tail command:

tail -f /usr/local/apache/logs/error_log | grep -v "CGI_TIME\|trackLog"

How to find historic logs

If you wanted to look at historical stuff there is a place here on hive:

/hive/data/inside/wwwstats/RR/2018/
/hive/data/inside/wwwstats/RR/2019/

Let's say you wanted to look at specific events on Feb. 14th on the hgw1 machine for 2019:

zgrep "Feb 14" /hive/data/inside/wwwstats/RR/2019/hgw1/error_log.20190211.gz | less

Ask Matt for more ways to use these logs, such as building a temporary directory of simlinks to the logs for specific date ranges to do more complicated querries.


Example of accessing /hive/data/inside/wwwstats/RR/ for IP

Error log location for hgw2:

/hive/data/inside/wwwstats/RR/2019/hgw2

The last two directories are for $year and $machine

Within that directory:

-rw-rw-r--  1 hiram genecats 498M Mar 28 10:40 access_log
-rw-rw-r--  1 hiram genecats 891M Mar 28 10:40 error_log
-rw-r--r--  1 hiram protein   54M Jan 14 03:10 access_log.20190106.gz
-rw-rw-r--  1 hiram genecats  48M Jan 20 03:30 access_log.20190114.gz
-rw-rw-r--  1 hiram genecats  58M Jan 28 03:41 access_log.20190120.gz
...

There is an access_log, which is the latest log (within the last week). Same for the error_log. Once a log is >week old, it gets compressed and the file name is 2019-month-day(start of week). Ex. access_log.20190120.gz.


If you wanted to research what a recent offending IP was doing, you would check the different machines, in this case we saw there is some activity on hgw2, and do the following:

$cat access_log | awk '{print $1}' | sort | uniq -c | sort -r | head

  33386 54.82.172.170
  28818 130.91.194.123
  16993 130.223.121.12 (Offender) (16993 hits on hgw2)

Ex. Summarizing specific 130.223.121.12 IP actions

$cat access_log | grep "130.223.121.12" | awk '{print $7}' | cut -f 3 -d '/' | cut -f 1 -d "?" | sort | uniq -c | sort -r | head

  16905 hgTables
     18 hgt
     14 hgTracks

So this 130.223.121.12 IP is hammering hgTables.

Ex. Checking specific 130.223.121.12 IP requests

$cat access_log | grep "130.223.121.12" | head -n 3

130.223.121.12 - - [25/Mar/2019:08:12:27 -0700] "POST /cgi-bin/hgTables? HTTP/1.1" 200 1222212 "-" "Python-urllib/2.7" 1257605 microseconds
130.223.121.12 - - [25/Mar/2019:08:12:30 -0700] "POST /cgi-bin/hgTables? HTTP/1.1" 200 167045 "-" "Python-urllib/2.7" 2063636 microseconds
130.223.121.12 - - [25/Mar/2019:08:12:31 -0700] "POST /cgi-bin/hgTables? HTTP/1.1" 200 681933 "-" "Python-urllib/2.7" 2170569 microseconds

The results show Python-urllib/2.7 which shows they are using python to programmatically query, and the POST is a different way to input parameters without using the URL, so we don't know what they're searching for. We do see that they are doing multiple queries a minute without waiting for a response, which is more evident as they currently have an almost 60 minute bottleneck.

Ways to track Genome Browser usage stats

It is worth noting that the first method described (coocuurenceCounts.c + filterTop.py) make a matrix of track usage, and can therefore provide relational information between the tracks. For general statistics (e.g. total track usage, total db used, usage by month) the second method (generateUsageStats.py) is recommended as it is more streamlined.

Using cooccurrenceCounts.c and filterTop.py

This method was deduced by Chris Lee and Max Haussler. It uses a combination of coocurrenceCounts.c which creates a usage matrix out of the error logs, and then filterTop.py which filters out a certain number of tracks from the matrix to make the data more digestible. Keep in mind every instance of an hgsid that meets the threshold is counted upon refresh. This means that a user browsing heavily will be represented more heavily (as opposed to generateUsageStats.py). Most of this explains how to find which tracks are used commonly with each other, however at the end is an example of how to get general usage stats.

cooccurrenceCounts

This program can be found in the following directory:

kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts

You may have to do a 'make' in the directory to compile it. If you get an error, it may mean you are missing a x86_64/ directory in your bin. Hiram was able to quickly diagnose this. The program can be ran with no arguments to show a usage statement that includes examples of how to run it:

$ cooccurrenceCounts 
coocurrenceCounts - Experiment to build a coocurrence matrix out of lists.
usage:
   cooccurrenceCounts -db=dbName inFile outFile
options:
   -db=XXX    = ucsc db name to filter on
   -tc=XXX    = min (ie >=) filter for track counts
   -hc=XXX    = min (ie >=) filter for hgsid counts
   -hgsids    = don't make track matrix, instead print a tab separated 
                file of hgsids:hgsidUseCount and track usages, for example: 
                hgsid:hgsidUseCount trackCount1 trackCount2 ... 
   -verbose=X = logging level, log messages print to stderr
...

In order to run the program, you must first put together a list of what error logs you are interested in extracting from. The best way to do this is to use makeLogSymLinks.sh. First you will want to make a directory in hive:

$ mkdir /hive/users/lrnassar/ErrorLogs

Next you will run makeLogSymLinks.sh to populate the directory with symlinks to all the error logs of interest (it can also be ran with no arguments for a usage statement). I recommend running it from Within your new directory:

$ cd /hive/users/lrnassar/ErrorLogs
$ makeLogSymLinks.sh 20180107 12 error

This will create symlinks to all the error logs starting with the 7th of January (the date has to match a file in /hive/data/inside/wwwstats/RR), and going forward from there 12 months. Also I state I want the error logs, which is what we extract the data from. This particular command created symlinks to ~200 files (all of 2018).

Next you will want to run cooccurrenceCounts. It is recommended you make a new directory for the output, in this case I chose to do it in (/hive/users/lrnassar/MatrixData).

$ /hive/users/lrnassar/MatrixData
$ cooccurrenceCounts -db=hg19 -hc=400 -tc=100 ../ErrorLogs/ 2018UsageStats.txt -verbose=2

The -hc flag stipulates how many times an hgsid needs to be present for the data not to be filtered out. In this case I chose a very high number to try and filter out bots, as I was using the entire 2018's worth of data. Worth noting the example usage statements are not as high. The -tc filters out tracks that do not occur very often. Likewise, the chosen cutoff is higher than the examples. For reference, the above command took about 15 minutes to run.

At this point, you should have a list of tracks you want to get stats on (for an example on just general usage, see the bottom of this section). In my example, we had created a session based around 'clinical' tracks, and we wanted to see if users often turned on other tracks in combination with them. My track list is as follows (found in /hive/users/lrnassar/MatrixData/clinicalThemeTrackNames.txt):

  • knownGene
  • ncbiRefSeqCurated
  • pubsBlat
  • pubsMarkerSnp
  • iscaPathGainCum
  • iscaPathLossCum
  • iscaPathogenic
  • clinvarMain
  • clinvarCnv
  • omimAvSnp
  • spMut
  • snp150Common
  • snp150Flagged
  • tgpPhase3
  • evsEsp6500
  • exacVariants
  • decipher
  • decipherSnvs
  • hgmd
  • lovd

Note: You have to input the track name, not the long/short label

You can now create a list of all the tracks in your usage stats to make sure that your tracks made the cutoff. For example, if you used a very high -tc cutoff for coocurrentcounts, many tracks could have been filtered out. You will get an error if you try to run the python filter script on a track that doesn't exist in your matrix.

wc -l clinicalThemeTrackNames.txt
20
head -1 2018UsageStats.txt > 2018TrackList.txt
grep -owf clinicalThemeTrackNames.txt 2018TrackList.txt | wc -l
20

Here we can see that 20/20 tracks are in the matrix. The next step is optional, but in my case I wanted to remove all default tracks from my track list. Default tracks tend to be associated with tons of other tracks (specially other default tracks) as they are on... by default. You can create your own list; the list we put together for hg19 is here (/hive/users/lrnassar/MatrixData/hg19Defaults.txt).

grep -vwf hg19Defaults.txt clinicalThemeTrackNames.txt > clinicalThemeTrackNamesNoDefaults.txt
wc -l clinicalThemeTrackNamesNoDefaults.txt
16 clinicalThemeTrackNamesNoDefaults.txt

filterTop.py

Here we see that 4 default tracks were filtered out of the original 20. Next we want to create a submatrix for each of our chosen tracks. This is done with filterTop.py. A usage statement that includes for the script can be seen with the --help flag:

$HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py --help

For our purposes, we will want to run filterTop.py on each of our tracks. We can loop the script as follows:

for track in $(cat temp.txt); do
        # don't include the default tracks
        $HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py \
        -n 50 -t $track --default-track-list /hive/users/lrnassar/MatrixData/hg19Defaults.txt \
        /hive/users/lrnassar/MatrixData/2018UsageStatsNoPrefix.txt $track.top50.nodefaults
done

Keep in mind the --default-track-list is an optional flag. If you use it, you follow the flag with a file of all default tracks (or any tracks you want excluded from output). In my case, this outputs 16 submatrixes of each of the 50 top tracks used in conjunction with the original 16 input tracks. These matrices can be freely explored, but for my purposes, I wanted a more digestible output. So instead I pulled out the header line which represents the 50 tracks most often seen with each of the input tracks, then changed all the tabs into newlines, and appended them all to a single file.

for each in $(ls *.nodefaults); do
        head -n 1 $each | sed 's/\t/\n/g' >> usageResults.txt
done

Now we have a file (usageResults.txt) which had a list of all the tracks in the top 50 for all of our input tracks. Next, we sort and count the unique entries:

cat usageResults.txt | sort | uniq -c | sort -r | head -n 30 > usageResults.final.txt
cat usageResults.final.txt | head -n 10 
     16 dgvPlus
     16 dgvMerged
     15 omimLocation
     15 decipher
     15 cnvDevDelay
     15 clinvarMain
     15 clinvarCnv
     15 clinvar
     14 omimAvSnp
     13 genomicSuperDup

Here we have the total results. The top number being 16 (the original number of input tracks). In this case, dgvMerged and dgvPlus are both in the top 50 usage for all 16 input tracks. This means they are very often used together with those tracks, etc. This was useful for me to see what tracks users like to see together, but has other applications as well.

General Stats Example

If you just want to generate a usage matrix of say, the top 100 tracks users use overall, you can use the same python script but not designate a specific track to filter by.

#See the top 100 tracks that are used from a single error log:
$HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py -n 100 2018UsageStatsNoPrefix.txt top100Tracks2018.txt
#or with no defaults
$HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py -n 100 --default-track-list /hive/users/lrnassar/MatrixData/hg19Defaults.txt 2018UsageStatsNoPrefix.txt top100Tracks2018NoDefaults.txt

Using generateUsageStats.py

This method was put together my Matthew Speir. It generates usage statistics for dbs, tracks, and hubs tracks using Apache error_log files. This script only counts an hgsid use a single time (as opposed to the previous method), that is to say heavy browser users are only represented a single time if their hgsid remains the same.

This program can be found in the following directory:

kent/src/hg/logCrawl/dbTrackAndSearchUsage/generateUsageStats.py

The program can be ran with no arguments to show a usage statement that includes examples of how to run it:

$ ./generateUsageStats.py 
usage: generateUsageStats.py [-h] [-f FILENAME] [-d DIRNAME] [-p] [-m] [-j]
                             [-t] [-o OUTDIR]

Generates usage statistics for dbs, tracks, and hubs tracks using Apache error_log files

optional arguments:
  -h, --help            show this help message and exit
  -f FILENAME, --fileName FILENAME
                        input file name, must be space-separated Apache
                        error_log file
  -d DIRNAME, --dirName DIRNAME
                        input directory name, files must be space-separated
                        error_log files. No other files should be present in
                        this directory.
  -p, --perMonth        output file containing info on db/track/hub track use
                        per month
  -m, --monthYear       output file containing month/year pairs (e.g. "Mar
                        2017")
  -j, --jsonOut         output json files for summary dictionaries
  -t, --outputDefaults  output file containing info on default track usage for
                        top 15 most used assemblies
  -o OUTDIR, --outDir OUTDIR
                        directory in which to place output files

In order to run the program, you must first put together a list of what error logs you are interested in extracting from. The best way to do this is to use makeLogSymLinks.sh. First you will want to make a directory in hive:

$ mkdir /hive/users/lrnassar/ErrorLogs

Next you will run makeLogSymLinks.sh to populate the directory with symlinks to all the error logs of interest (it can also be ran with no arguments for a usage statement). I recommend running it from Within your new directory:

$ cd /hive/users/lrnassar/ErrorLogs
$ makeLogSymLinks.sh 20180107 12 error

This will create symlinks to all the error logs starting with the 7th of January (the date has to match a file in /hive/data/inside/wwwstats/RR), and going forward from there 12 months. Also I state I want the error logs, which is what we extract the data from. This particular command created symlinks to ~200 files (all of 2018).

Next you can call the python script to generate the usage stats. I ran the program in its own directory using the following command:

$ pwd
/hive/users/lrnassar/UsageStatsPython
~/kent/src/hg/logCrawl/dbTrackAndSearchUsage/generateUsageStats.py -d ../ErrorLogs/ -o . -pm

The -d flag directs the script to the directory with the error logs, -o designates the output directory, -p tells it to include 'perMonth' numbers (as opposed to just year), and -m tells it to include month/year pairs in the output.