Apache error log output

Here are some examples of errors we often see but that are of no concern:

This is seen right after a push of new CGIs and is caused by browsers having a stale version of the CGIs/css file. A bunch of these will show up right after the push, but they should subside after several minutes as browser grab new versions.

 Cannot find correct version of file 'HGStyle.css'; this is due to an installation error, referer: http://hgw0.gi.ucsc.edu/

 File does not exist: /data/apache/htdocs/js/jquery.imgareaselect-v291.js, referer: http://genome.ucsc.edu/cgi-bin/hgTracks

We also see these often but they tend to right themselves as the load on the browser waxes and wanes.

 Too many popAbortHandlers, referer: http://genome.ucsc.edu/cgi-bin/hgTables?position=chr4:150237197-150248873&hgsid=353372405&refGene=pack&hgFind.matches=NM_023119,

How to look at logs

When you are on a machine you can look at the logs by following this file:

/usr/local/apache/logs/error_log

Older logs with names like this:

/usr/local/apache/logs/error_log.1

/usr/local/apache/logs/error_log.2

Example tail command:

tail -f /usr/local/apache/logs/error_log | grep -v "CGI_TIME\|trackLog"

How to find historic logs

If you wanted to look at historical stuff there is a place here on hive:

/hive/data/inside/wwwstats/RR/2018/

/hive/data/inside/wwwstats/RR/2019/

Let's say you wanted to look at specific events on Feb. 14th on the hgw1 machine for 2019:

zgrep "Feb 14" /hive/data/inside/wwwstats/RR/2019/hgw1/error_log.20190211.gz | less

Ask Matt for more ways to use these logs, such as building a temporary directory of simlinks to the logs for specific date ranges to do more complicated querries.

Example of accessing /hive/data/inside/wwwstats/RR/ for IP

Error log location for hgw2:

/hive/data/inside/wwwstats/RR/2019/hgw2

The last two directories are for $year and $machine

Within that directory:

-rw-rw-r--  1 hiram genecats 498M Mar 28 10:40 access_log
-rw-rw-r--  1 hiram genecats 891M Mar 28 10:40 error_log
-rw-r--r--  1 hiram protein   54M Jan 14 03:10 access_log.20190106.gz
-rw-rw-r--  1 hiram genecats  48M Jan 20 03:30 access_log.20190114.gz
-rw-rw-r--  1 hiram genecats  58M Jan 28 03:41 access_log.20190120.gz
...

There is an access_log, which is the latest log (within the last week). Same for the error_log. Once a log is >week old, it gets compressed and the file name is 2019-month-day(start of week). Ex. access_log.20190120.gz.

If you wanted to research what a recent offending IP was doing, you would check the different machines, in this case we saw there is some activity on hgw2, and do the following:

$cat access_log | awk '{print $1}' | sort | uniq -c | sort -r | head

  33386 54.82.172.170
  28818 130.91.194.123
  16993 130.223.121.12 (Offender) (16993 hits on hgw2)

Ex. Summarizing specific 130.223.121.12 IP actions

$cat access_log | grep "130.223.121.12" | awk '{print $7}' | cut -f 3 -d '/' | cut -f 1 -d "?" | sort | uniq -c | sort -r | head

  16905 hgTables
     18 hgt
     14 hgTracks

So this 130.223.121.12 IP is hammering hgTables.

Ex. Checking specific 130.223.121.12 IP requests

$cat access_log | grep "130.223.121.12" | head -n 3

130.223.121.12 - - [25/Mar/2019:08:12:27 -0700] "POST /cgi-bin/hgTables? HTTP/1.1" 200 1222212 "-" "Python-urllib/2.7" 1257605 microseconds
130.223.121.12 - - [25/Mar/2019:08:12:30 -0700] "POST /cgi-bin/hgTables? HTTP/1.1" 200 167045 "-" "Python-urllib/2.7" 2063636 microseconds
130.223.121.12 - - [25/Mar/2019:08:12:31 -0700] "POST /cgi-bin/hgTables? HTTP/1.1" 200 681933 "-" "Python-urllib/2.7" 2170569 microseconds

The results show Python-urllib/2.7 which shows they are using python to programmatically query, and the POST is a different way to input parameters without using the URL, so we don't know what they're searching for. We do see that they are doing multiple queries a minute without waiting for a response, which is more evident as they currently have an almost 60 minute bottleneck.

Ways to track Genome Browser usage stats

It is worth noting that the first method described (coocuurenceCounts.c + filterTop.py) make a matrix of track usage, and can therefore provide relational information between the tracks. For general statistics (e.g. total track usage, total db used, usage by month) the second method (generateUsageStats.py) is recommended as it is more streamlined.

Using cooccurrenceCounts.c and filterTop.py

This method was deduced by Chris Lee and Max Haussler. It uses a combination of coocurrenceCounts.c which creates a usage matrix out of the error logs, and then filterTop.py which filters out a certain number of tracks from the matrix to make the data more digestible. Keep in mind every instance of an hgsid that meets the threshold is counted upon refresh. This means that a user browsing heavily will be represented more heavily (as opposed to generateUsageStats.py). Most of this explains how to find which tracks are used commonly with each other, however at the end is an example of how to get general usage stats.

cooccurrenceCounts

This program can be found in the following directory:

kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts

You may have to do a 'make' in the directory to compile it. If you get an error, it may mean you are missing a x86_64/ directory in your bin. Hiram was able to quickly diagnose this. The program can be ran with no arguments to show a usage statement that includes examples of how to run it:

$ cooccurrenceCounts 
coocurrenceCounts - Experiment to build a coocurrence matrix out of lists.
usage:
   cooccurrenceCounts -db=dbName inFile outFile
options:
   -db=XXX    = ucsc db name to filter on
   -tc=XXX    = min (ie >=) filter for track counts
   -hc=XXX    = min (ie >=) filter for hgsid counts
   -hgsids    = don't make track matrix, instead print a tab separated 
                file of hgsids:hgsidUseCount and track usages, for example: 
                hgsid:hgsidUseCount trackCount1 trackCount2 ... 
   -verbose=X = logging level, log messages print to stderr
...

In order to run the program, you must first put together a list of what error logs you are interested in extracting from. The best way to do this is to use makeLogSymLinks.sh. First you will want to make a directory in hive:

$ mkdir /hive/users/lrnassar/ErrorLogs

Next you will run makeLogSymLinks.sh to populate the directory with symlinks to all the error logs of interest (it can also be ran with no arguments for a usage statement). I recommend running it from Within your new directory:

$ cd /hive/users/lrnassar/ErrorLogs
$ makeLogSymLinks.sh 20180107 12 error

This will create symlinks to all the error logs starting with the 7th of January (the date has to match a file in /hive/data/inside/wwwstats/RR), and going forward from there 12 months. Also I state I want the error logs, which is what we extract the data from. This particular command created symlinks to ~200 files (all of 2018).

Next you will want to run cooccurrenceCounts. It is recommended you make a new directory for the output, in this case I chose to do it in (/hive/users/lrnassar/MatrixData).

$ /hive/users/lrnassar/MatrixData
$ cooccurrenceCounts -db=hg19 -hc=400 -tc=100 ../ErrorLogs/ 2018UsageStats.txt -verbose=2

The -hc flag stipulates how many times an hgsid needs to be present for the data not to be filtered out. In this case I chose a very high number to try and filter out bots, as I was using the entire 2018's worth of data. Worth noting the example usage statements are not as high. The -tc filters out tracks that do not occur very often. Likewise, the chosen cutoff is higher than the examples. For reference, the above command took about 15 minutes to run.

At this point, you should have a list of tracks you want to get stats on (for an example on just general usage, see the bottom of this section). In my example, we had created a session based around 'clinical' tracks, and we wanted to see if users often turned on other tracks in combination with them. My track list is as follows (found in /hive/users/lrnassar/MatrixData/clinicalThemeTrackNames.txt):

knownGene
ncbiRefSeqCurated
pubsBlat
pubsMarkerSnp
iscaPathGainCum
iscaPathLossCum
iscaPathogenic
clinvarMain
clinvarCnv
omimAvSnp
spMut
snp150Common
snp150Flagged
tgpPhase3
evsEsp6500
exacVariants
decipher
decipherSnvs
hgmd
lovd

Note: You have to input the track name, not the long/short label

You can now create a list of all the tracks in your usage stats to make sure that your tracks made the cutoff. For example, if you used a very high -tc cutoff for coocurrentcounts, many tracks could have been filtered out. You will get an error if you try to run the python filter script on a track that doesn't exist in your matrix.

wc -l clinicalThemeTrackNames.txt
20
head -1 2018UsageStats.txt > 2018TrackList.txt
grep -owf clinicalThemeTrackNames.txt 2018TrackList.txt | wc -l
20

Here we can see that 20/20 tracks are in the matrix. The next step is optional, but in my case I wanted to remove all default tracks from my track list. Default tracks tend to be associated with tons of other tracks (specially other default tracks) as they are on... by default. You can create your own list; the list we put together for hg19 is here (/hive/users/lrnassar/MatrixData/hg19Defaults.txt).

grep -vwf hg19Defaults.txt clinicalThemeTrackNames.txt > clinicalThemeTrackNamesNoDefaults.txt
wc -l clinicalThemeTrackNamesNoDefaults.txt
16 clinicalThemeTrackNamesNoDefaults.txt

filterTop.py

Here we see that 4 default tracks were filtered out of the original 20. Next we want to create a submatrix for each of our chosen tracks. This is done with filterTop.py. A usage statement that includes for the script can be seen with the --help flag:

$HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py --help

For our purposes, we will want to run filterTop.py on each of our tracks. We can loop the script as follows:

for track in $(cat temp.txt); do
        # don't include the default tracks
        $HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py \
        -n 50 -t $track --default-track-list /hive/users/lrnassar/MatrixData/hg19Defaults.txt \
        /hive/users/lrnassar/MatrixData/2018UsageStatsNoPrefix.txt $track.top50.nodefaults
done

Keep in mind the --default-track-list is an optional flag. If you use it, you follow the flag with a file of all default tracks (or any tracks you want excluded from output). In my case, this outputs 16 submatrixes of each of the 50 top tracks used in conjunction with the original 16 input tracks. These matrices can be freely explored, but for my purposes, I wanted a more digestible output. So instead I pulled out the header line which represents the 50 tracks most often seen with each of the input tracks, then changed all the tabs into newlines, and appended them all to a single file.

for each in $(ls *.nodefaults); do
        head -n 1 $each | sed 's/\t/\n/g' >> usageResults.txt
done

Now we have a file (usageResults.txt) which had a list of all the tracks in the top 50 for all of our input tracks. Next, we sort and count the unique entries:

cat usageResults.txt | sort | uniq -c | sort -r | head -n 30 > usageResults.final.txt
cat usageResults.final.txt | head -n 10 
     16 dgvPlus
     16 dgvMerged
     15 omimLocation
     15 decipher
     15 cnvDevDelay
     15 clinvarMain
     15 clinvarCnv
     15 clinvar
     14 omimAvSnp
     13 genomicSuperDup

Here we have the total results. The top number being 16 (the original number of input tracks). In this case, dgvMerged and dgvPlus are both in the top 50 usage for all 16 input tracks. This means they are very often used together with those tracks, etc. This was useful for me to see what tracks users like to see together, but has other applications as well.

General Stats Example

If you just want to generate a usage matrix of say, the top 100 tracks users use overall, you can use the same python script but not designate a specific track to filter by.

#See the top 100 tracks that are used from a single error log:
$HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py -n 100 2018UsageStatsNoPrefix.txt top100Tracks2018.txt
#or with no defaults
$HOME/kent/src/hg/logCrawl/dbTrackAndSearchUsage/coocurrenceCounts/filterTop.py -n 100 --default-track-list /hive/users/lrnassar/MatrixData/hg19Defaults.txt 2018UsageStatsNoPrefix.txt top100Tracks2018NoDefaults.txt

Using generateUsageStats.py

This method was put together my Matthew Speir. It generates usage statistics for dbs, tracks, and hubs tracks using Apache error_log files. This script only counts an hgsid use a single time (as opposed to the previous method), that is to say heavy browser users are only represented a single time if their hgsid remains the same.

This program can be found in the following directory:

kent/src/hg/logCrawl/dbTrackAndSearchUsage/generateUsageStats.py

The program can be ran with no arguments to show a usage statement that includes examples of how to run it:

$ ./generateUsageStats.py 
usage: generateUsageStats.py [-h] [-f FILENAME] [-d DIRNAME] [-p] [-m] [-j]
                             [-t] [-o OUTDIR]

Generates usage statistics for dbs, tracks, and hubs tracks using Apache error_log files

optional arguments:
  -h, --help            show this help message and exit
  -f FILENAME, --fileName FILENAME
                        input file name, must be space-separated Apache
                        error_log file
  -d DIRNAME, --dirName DIRNAME
                        input directory name, files must be space-separated
                        error_log files. No other files should be present in
                        this directory.
  -p, --perMonth        output file containing info on db/track/hub track use
                        per month
  -m, --monthYear       output file containing month/year pairs (e.g. "Mar
                        2017")
  -j, --jsonOut         output json files for summary dictionaries
  -t, --outputDefaults  output file containing info on default track usage for
                        top 15 most used assemblies
  -o OUTDIR, --outDir OUTDIR
                        directory in which to place output files

In order to run the program, you must first put together a list of what error logs you are interested in extracting from. The best way to do this is to use makeLogSymLinks.sh. First you will want to make a directory in hive:

$ mkdir /hive/users/lrnassar/ErrorLogs

Next you will run makeLogSymLinks.sh to populate the directory with symlinks to all the error logs of interest (it can also be ran with no arguments for a usage statement). I recommend running it from Within your new directory:

$ cd /hive/users/lrnassar/ErrorLogs
$ makeLogSymLinks.sh 20180107 12 error

This will create symlinks to all the error logs starting with the 7th of January (the date has to match a file in /hive/data/inside/wwwstats/RR), and going forward from there 12 months. Also I state I want the error logs, which is what we extract the data from. This particular command created symlinks to ~200 files (all of 2018).

Next you can call the python script to generate the usage stats. I ran the program in its own directory using the following command:

$ pwd
/hive/users/lrnassar/UsageStatsPython
~/kent/src/hg/logCrawl/dbTrackAndSearchUsage/generateUsageStats.py -d ../ErrorLogs/ -o . -pm

The -d flag directs the script to the directory with the error logs, -o designates the output directory, -p tells it to include 'perMonth' numbers (as opposed to just year), and -m tells it to include month/year pairs in the output.

For all of the 2018 error logs, generateUsageStats.py took about 3.5 hours to run. With the flags chosen, it created the following files (some with pretty large sizes):

-rw-rw-r-- 1 lrnassar protein  102K Apr 16 13:00 dbCounts.perMonth.tsv
-rw-rw-r-- 1 lrnassar protein   24K Apr 16 12:57 dbCounts.tsv
-rw-rw-r-- 1 lrnassar protein  153M Apr 16 13:00 dbUsers.perMonth.tsv
-rw-rw-r-- 1 lrnassar protein  127M Apr 16 12:57 dbUsers.tsv
-rw-rw-r-- 1 lrnassar protein   117 Apr 16 13:00 monthYear.tsv
-rw-rw-r-- 1 lrnassar protein  172M Apr 16 13:04 trackCounts.perMonth.tsv
-rw-rw-r-- 1 lrnassar protein   68M Apr 16 13:00 trackCounts.tsv
-rw-rw-r-- 1 lrnassar protein   27M Apr 16 13:04 trackCountsHubs.perMonth.tsv
-rw-rw-r-- 1 lrnassar protein  4.2M Apr 16 13:00 trackCountsHubs.tsv
-rw-rw-r-- 1 lrnassar protein  5.8G Apr 16 13:04 trackUsers.perMonth.tsv
-rw-rw-r-- 1 lrnassar protein  5.1G Apr 16 13:00 trackUsers.tsv
-rw-rw-r-- 1 lrnassar protein  285M Apr 16 13:04 trackUsersHubs.perMonth.tsv
-rw-rw-r-- 1 lrnassar protein  258M Apr 16 13:00 trackUsersHubs.tsv

This output can be simplified to 4 primary outputs:

dbCounts.tsv
dbUsers.tsv
trackCounts.tsv
trackCountsHubs.tsv
trackUsers.tsv
trackUsersHubs.tsv

dbCounts.tsv

dbCounts.tsv shows which databases came up for each hgsid counted. This provides a general outline of UCSC GB assembly usage:

$ sort dbCounts.tsv -rnk2 | head
hg19	1327811
hg38	854810
mm10	248086
mm9	122045
hg18	85716
dm2	28618
dm3	24265
dm6	19343
rn6	14095
danRer10	13565

dbUsers.tsv

dbUsers.tsv tracks the number of times an hgsid refreshed the page or the hgTracks image (e.g. zoom, change location), and what database was being used. It provides information on how much browsing users do on a particular hgsid before changing it, etc.

$ sort dbUsers.tsv -rnk3 | head
hg19	669716131_CIe2uSQnjEoXSejA3J1HbNz7jjsF	1178311
hg19	667312737_X1OhY2Avi1bFF7wHO0jBQxicN2l6	580230
mm9	704619695_YvYKXleywtyEbFeZ3t2ElA1V1pGZ	321340
hg16	684504471_EGbSkkh340PUaEz2aELLz4AwqslI	210375
hg19	684548115_vbzUPQkjlAYIAcGN3pay9bw26zFV	164179
hg19	695129801_OltCuOauqya0ZBR6BUSIYgnePjhU	139952
hg19	686146493_vd8DvaXTRAcDO2lY2kg9oY5Ykygp	109248
hg19	676410973_X1sSQVduVqdaQaIJf8VybkrgHBKT	103357
hg19	679779053_6mMwink3vkv1ABHsRknc3GPzgAMw	103320
hg38	686731999_1YAq7ObJaY8W8hFzJuBhaxfJbWAm	93817

trackCounts.tsv

trackCounts.tsv shows the number of times a track was observed and which database it was seen with:

$ sort trackCounts.tsv -rnk3 | head
hg19	knownGene	1022394
hg19	cytoBandIdeo	1001358
hg19	cons100way	790623
hg38	knownGene	775122
hg19	pubs	756332
hg38	cytoBandIdeo	756204
hg38	refSeqComposite	754162
hg19	rmsk	749855
hg19	refSeqComposite	743841
hg38	ncbiRefSeqCurated	731005

trackCountsHubs.tsv

trackCountsHubs.tsv shows the number of times a *public hub* (know individually added hubs are counted, only UCSC public hubs) was seen, name of the hub, the track within the hub, and the database it belongs to. This may be useful to see which public hubs are popular, and which ones do not get much use:

$ sort trackCountsHubs.tsv -rnk4 -t $'\t' | head
Roadmap Epigenomics Data Complete Collection at Wash U VizHub	hg19	RoadmapConsolidatedH	8526
FANTOM5	hg19	FANTOM_CAT_lv4_stringe	7743
Roadmap Epigenomics Data Complete Collection at Wash U VizHub	hg19	RoadmapUniS	7440
Roadmap Epigenomics Data Complete Collection at Wash U VizHub	hg19	RoadmapBIHistoneModificati	7187
Roadmap Epigenomics Integrative Analysis Hub	hg19	RoadmapConsolidatedAssaya220	6865
Roadmap Epigenomics Integrative Analysis Hub	hg19	RoadmapUnconsolidatedAssaya230	6688
Roadmap Epigenomics Integrative Analysis Hub	hg19	RoadmapUnconsolidatedAssaya220	6686
Roadmap Epigenomics Data Complete Collection at Wash U VizHub	hg19	RoadmapUCSFHistoneModificati	5901
Roadmap Epigenomics Data Complete Collection at Wash U VizHub	hg19	RoadmapMeth	5659
Roadmap Epigenomics Data Complete Collection at Wash U VizHub	hg19	RoadmapDNa	5170

trackUsers.tsv

trackUsers.tsv is similar to dbUsers.tsv, except it also tracks what tracks were seen by each hgsid and database.

$ sort trackUsers.tsv -rnk4 -t $'\t' | head
hg16	refGene	684504471_EGbSkkh340PUaEz2aELLz4AwqslI	210375
hg16	stsMap	684504471_EGbSkkh340PUaEz2aELLz4AwqslI	210373
hg16	snp	684504471_EGbSkkh340PUaEz2aELLz4AwqslI	210373
hg16	rmsk	684504471_EGbSkkh340PUaEz2aELLz4AwqslI	210373
hg16	mzPt1Mm3Rn3Gg2_pHMM_wig	684504471_EGbSkkh340PUaEz2aELLz4AwqslI	210373
hg16	mzPt1Mm3Rn3Gg2_pHMM	684504471_EGbSkkh340PUaEz2aELLz4AwqslI	210373
hg16	mrna	684504471_EGbSkkh340PUaEz2aELLz4AwqslI	210373
hg16	intronEst	684504471_EGbSkkh340PUaEz2aELLz4AwqslI	210373
hg16	hg17Kg	684504471_EGbSkkh340PUaEz2aELLz4AwqslI	210373
hg16	gap	684504471_EGbSkkh340PUaEz2aELLz4AwqslI	210373

trackUsersHubs.tsv

trackUsersHubs.tsv is similar to trackCountsHubs.tsv, however it includes hgsids. It shows which tracks and hubs were used by hgsids, how many times hgTracks was refereshed/image refreshed, and what database was being used.

$ sort trackUsersHubs.tsv -rnk5 -t $'\t' | head
ENCODE Analysis Hub	hg19	wgEncodeUwHistoneK562H3k4me3StdAlnsign	669716131_CIe2uSQnjEoXSejA3J1HbNz7jjsF	97365
ENCODE Analysis Hub	hg19	wgEncodeUwHistoneK562H3k4me3StdA	669716131_CIe2uSQnjEoXSejA3J1HbNz7jjsF	97365
ENCODE Analysis Hub	hg19	wgEncodeUwHistoneK562H3k36me3StdAlnsign	669716131_CIe2uSQnjEoXSejA3J1HbNz7jjsF	97365
ENCODE Analysis Hub	hg19	wgEncodeUwHistoneK562H3k36me3StdA	669716131_CIe2uSQnjEoXSejA3J1HbNz7jjsF	97365
ENCODE Analysis Hub	hg19	wgEncodeUwHistoneK562H3k27me3StdAlnsign	669716131_CIe2uSQnjEoXSejA3J1HbNz7jjsF	97365
ENCODE Analysis Hub	hg19	wgEncodeUwHistoneK562H3k27me3StdA	669716131_CIe2uSQnjEoXSejA3J1HbNz7jjsF	97365
ENCODE Analysis Hub	hg19	wgEncodeUwHistoneGm12878H3k4me3StdAlnsign	669716131_CIe2uSQnjEoXSejA3J1HbNz7jjsF	97365
ENCODE Analysis Hub	hg19	wgEncodeUwHistoneGm12878H3k4me3StdA	669716131_CIe2uSQnjEoXSejA3J1HbNz7jjsF	97365
ENCODE Analysis Hub	hg19	wgEncodeUwHistoneGm12878H3k36me3StdAlnsign	669716131_CIe2uSQnjEoXSejA3J1HbNz7jjsF	97365
ENCODE Analysis Hub	hg19	wgEncodeUwHistoneGm12878H3k36me3StdA	669716131_CIe2uSQnjEoXSejA3J1HbNz7jjsF	97365