Apache error log output

From Genecats
Jump to navigationJump to search

Here are some examples of errors we often see but that are of no concern:

This is seen right after a push of new CGIs and is caused by browsers having a stale version of the CGIs/css file. A bunch of these will show up right after the push, but they should subside after several minutes as browser grab new versions.

 Cannot find correct version of file 'HGStyle.css'; this is due to an installation error, referer: http://hgw0.gi.ucsc.edu/
 File does not exist: /data/apache/htdocs/js/jquery.imgareaselect-v291.js, referer: http://genome.ucsc.edu/cgi-bin/hgTracks

We also see these often but they tend to right themselves as the load on the browser waxes and wanes.

 Too many popAbortHandlers, referer: http://genome.ucsc.edu/cgi-bin/hgTables?position=chr4:150237197-150248873&hgsid=353372405&refGene=pack&hgFind.matches=NM_023119,

How to look at logs

When you are on a machine you can look at the logs by following this file:

/usr/local/apache/logs/error_log

Older logs with names like this:

/usr/local/apache/logs/error_log.1
/usr/local/apache/logs/error_log.2

Example tail command:

tail -f /usr/local/apache/logs/error_log | grep -v "CGI_TIME\|trackLog"

How to find historic logs

If you wanted to look at historical stuff there is a place here on hive:

/hive/data/inside/wwwstats/RR/2018/
/hive/data/inside/wwwstats/RR/2019/

Let's say you wanted to look at specific events on Feb. 14th on the hgw1 machine for 2019:

zgrep "Feb 14" /hive/data/inside/wwwstats/RR/2019/hgw1/error_log.20190211.gz | less

Ask Matt for more ways to use these logs, such as building a temporary directory of simlinks to the logs for specific date ranges to do more complicated querries.


Example of accessing /hive/data/inside/wwwstats/RR/ for IP

Error log location for hgw2:

/hive/data/inside/wwwstats/RR/2019/hgw2

The last two directories are for $year and $machine

Within that directory:

-rw-rw-r--  1 hiram genecats 498M Mar 28 10:40 access_log
-rw-rw-r--  1 hiram genecats 891M Mar 28 10:40 error_log
-rw-r--r--  1 hiram protein   54M Jan 14 03:10 access_log.20190106.gz
-rw-rw-r--  1 hiram genecats  48M Jan 20 03:30 access_log.20190114.gz
-rw-rw-r--  1 hiram genecats  58M Jan 28 03:41 access_log.20190120.gz
...

There is an access_log, which is the latest log (within the last week). Same for the error_log. Once a log is >week old, it gets compressed and the file name is 2019-month-day(start of week). Ex. access_log.20190120.gz.


If you wanted to research what a recent offending IP was doing, you would check the different machines, in this case we saw there is some activity on hgw2, and do the following:

$cat access_log | awk '{print $1}' | sort | uniq -c | sort -r | head

  33386 54.82.172.170
  28818 130.91.194.123
  16993 130.223.121.12 (Offender) (16993 hits on hgw2)

Ex. Summarizing specific 130.223.121.12 IP actions

$cat access_log | grep "130.223.121.12" | awk '{print $7}' | cut -f 3 -d '/' | cut -f 1 -d "?" | sort | uniq -c | sort -r | head

  16905 hgTables
     18 hgt
     14 hgTracks

So this 130.223.121.12 IP is hammering hgTables.

Ex. Checking specific 130.223.121.12 IP requests

$cat access_log | grep "130.223.121.12" | head -n 3

130.223.121.12 - - [25/Mar/2019:08:12:27 -0700] "POST /cgi-bin/hgTables? HTTP/1.1" 200 1222212 "-" "Python-urllib/2.7" 1257605 microseconds
130.223.121.12 - - [25/Mar/2019:08:12:30 -0700] "POST /cgi-bin/hgTables? HTTP/1.1" 200 167045 "-" "Python-urllib/2.7" 2063636 microseconds
130.223.121.12 - - [25/Mar/2019:08:12:31 -0700] "POST /cgi-bin/hgTables? HTTP/1.1" 200 681933 "-" "Python-urllib/2.7" 2170569 microseconds

The results show Python-urllib/2.7 which shows they are using python to programmatically query, and the POST is a different way to input parameters without using the URL, so we don't know what they're searching for. We do see that they are doing multiple queries a minute without waiting for a response, which is more evident as they currently have an almost 60 minute bottleneck.

Ways to track Genome Browser usage stats:

Using cooccurrenceCounts.c and filterTop.py

This method was deduced by Chris Lee and Max Haussler. It uses a combination of coocurrenceCounts.c which created a usage matrix out of the error logs, and then filterTop.py which filters out a certain number of tracks from the matrix to make the data more digestible. Most of this explains how to find which tracks are used commonly with each other, however at the end is an example of how to get general usage stats.

cooccurrenceCounts

This program can be found in the following directory:

kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts

You may have to do a 'make' in the directory to compile it. If you get an error, it may mean you are missing a x86_64/ directory in your bin. Hiram was able to quickly diagnose this. The program can be ran with no arguments to show a usage statement that includes examples of how to run it:

$ cooccurrenceCounts 
coocurrenceCounts - Experiment to build a coocurrence matrix out of lists.
usage:
   cooccurrenceCounts -db=dbName inFile outFile
options:
   -db=XXX    = ucsc db name to filter on
   -tc=XXX    = min (ie >=) filter for track counts
   -hc=XXX    = min (ie >=) filter for hgsid counts
   -hgsids    = don't make track matrix, instead print a tab separated 
                file of hgsids:hgsidUseCount and track usages, for example: 
                hgsid:hgsidUseCount trackCount1 trackCount2 ... 
   -verbose=X = logging level, log messages print to stderr
...

In order to run the program, you must first put together a list of what error logs you are interested in extracting from. The best way to do this is to use makeLogSymLinks.sh. First you will want to make a directory in hive:

$ mkdir /hive/users/lrnassar/ErrorLogs

Next you will run makeLogSymLinks.sh to populate the directory with symlinks to all the error logs of interest (it can also be ran with no arguments for a usage statement). I recommend running it from Within your new directory:

$ cd /hive/users/lrnassar/ErrorLogs
$ makeLogSymLinks.sh 20180107 12 error

This will create symlinks to all the error logs starting with the 7th of January (the date has to match a file in /hive/data/inside/wwwstats/RR), and going forward from there 12 months. Also I state I want the error logs, which is what we extract the data from. This particular command created symlinks to ~200 files (all of 2018).

Next you will want to run cooccurrenceCounts. It is recommended you make a new directory for the output, in this case I chose to do it in (/hive/users/lrnassar/MatrixData).

$ /hive/users/lrnassar/MatrixData
$ cooccurrenceCounts -db=hg19 -hc=400 -tc=100 ../ErrorLogs/ 2018UsageStats.txt -verbose=2

The -hc flag stipulates how many times an hgsid needs to be present for the data not to be filtered out. In this case I chose a very high number to try and filter out bots, as I was using the entire 2018's worth of data. Worth noting the example usage statements are not as high. The -tc filters out tracks that do not occur very often. Likewise, the chosen cutoff is higher than the examples. For reference, the above command took about 15 minutes to run.

At this point, you should have a list of tracks you want to get stats on (for an example on just general usage, see the bottom of this section). In my example, we had created a session based around 'clinical' tracks, and we wanted to see if users often turned on other tracks in combination with them. My track list is as follows (found in /hive/users/lrnassar/MatrixData/clinicalThemeTrackNames.txt):

  • knownGene
  • ncbiRefSeqCurated
  • pubsBlat
  • pubsMarkerSnp
  • iscaPathGainCum
  • iscaPathLossCum
  • iscaPathogenic
  • clinvarMain
  • clinvarCnv
  • omimAvSnp
  • spMut
  • snp150Common
  • snp150Flagged
  • tgpPhase3
  • evsEsp6500
  • exacVariants
  • decipher
  • decipherSnvs
  • hgmd
  • lovd
  • Note:* You have to input the track name, not the long/short label

You can now create a list of all the tracks in your usage stats to make sure that your tracks made the cutoff. For example, if you used a very high -tc cutoff for coocurrentcounts, many tracks could have been filtered out. You will get an error if you try to run the python filter script on a track that doesn't exist in your matrix.

wc -l clinicalThemeTrackNames.txt
20
head -1 2018UsageStats.txt > 2018TrackList.txt
grep -owf clinicalThemeTrackNames.txt 2018TrackList.txt | wc -l
20

Here we can see that 20/20 tracks are in the matrix. The next step is optional, but in my case I wanted to remove all default tracks from my track list. Default tracks tend to be associated with tons of other tracks (specially other default tracks) as they are on... by default. You can create your own list; the list we put together for hg19 is here (/hive/users/lrnassar/MatrixData/hg19Defaults.txt).

grep -vwf hg19Defaults.txt clinicalThemeTrackNames.txt > clinicalThemeTrackNamesNoDefaults.txt
wc -l clinicalThemeTrackNamesNoDefaults.txt
16 clinicalThemeTrackNamesNoDefaults.txt

filterTop.py

Here we see that 4 default tracks were filtered out of the original 20. Next we want to create a submatrix for each of our chosen tracks. This is done with filterTop.py. A usage statement that includes for the script can be seen with the --help flag:

$HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py --help

For our purposes, we will want to run filterTop.py on each of our tracks. We can loop the script as follows:

for track in $(cat temp.txt); do
        # don't include the default tracks
        $HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py \
        -n 50 -t $track --default-track-list /hive/users/lrnassar/MatrixData/hg19Defaults.txt \
        /hive/users/lrnassar/MatrixData/2018UsageStatsNoPrefix.txt $track.top50.nodefaults
done

Keep in mind the --default-track-list is an optional flag. If you use it, you follow the flag with a file of all default tracks (or any tracks you want excluded from output). In my case, this outputs 16 submatrixes of each of the 50 top tracks used in conjunction with the original 16 input tracks. These matrices can be freely explored, but for my purposes, I wanted a more digestible output. So instead I pulled out the header line which represents the 50 tracks most often seen with each of the input tracks, then changed all the tabs into newlines, and appended them all to a single file.

for each in $(ls *.nodefaults); do
        head -n 1 $each | sed 's/\t/\n/g' >> usageResults.txt
done

Now we have a file (usageResults.txt) which had a list of all the tracks in the top 50 for all of our input tracks. Next, we sort and count the unique entries:

cat usageResults.txt | sort | uniq -c | sort -r | head -n 30 > usageResults.final.txt
cat usageResults.final.txt | head -n 10 
     16 dgvPlus
     16 dgvMerged
     15 omimLocation
     15 decipher
     15 cnvDevDelay
     15 clinvarMain
     15 clinvarCnv
     15 clinvar
     14 omimAvSnp
     13 genomicSuperDup

Here we have the total results. The top number being 16 (the original number of input tracks). In this case, dgvMerged and dgvPlus are both in the top 50 usage for all 16 input tracks. This means they are very often used together with those tracks, etc. This was useful for me to see what tracks users like to see together, but has other applications as well.

General Stats Example

If you just want to generate a usage matrix of say, the top 100 tracks users use overall, you can use the same pythin script but not designate a specific track to filter by.

#See the top 100 tracks that are used from a single error log:
$HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py -n 100 2018UsageStatsNoPrefix.txt top100Tracks2018.txt
#or with no defaults
$HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py -n 100 --default-track-list /hive/users/lrnassar/MatrixData/hg19Defaults.txt 2018UsageStatsNoPrefix.txt top100Tracks2018NoDefaults.txt