Trash cleaners: Difference between revisions
(adding recovery from problems) |
(adding and expanding the recovery from problems section at the top of the page) |
||
Line 4: | Line 4: | ||
older files from the '''/trash/''' directory into a complex set of interlocking scripts. This | older files from the '''/trash/''' directory into a complex set of interlocking scripts. This | ||
discussion outlines the procedures and lock files that keep the system running safely. | discussion outlines the procedures and lock files that keep the system running safely. | ||
==recovery from problems== | |||
Login to '''rrnfs1''' and check if there are any currently running processes: | |||
ps -ef | grep -i qateam | |||
It may be the case that a previous instance simply hasn't completed yet. Let it finish, | |||
you do not want to interrupt this system. | |||
If there is nothing running, check the most recent log file to see if there is any message about | |||
the problem in: | |||
'''/export/userdata/rrLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt''' | |||
'''/export/userdata/betaLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt''' | |||
Or the temporary files under construction in '''/var/tmp/''' may have the error message | |||
from a failed command. Typical file names you may find there: | |||
-rw-rw-rw- 1 85743056 Jul 18 10:46 refreshList.O18591 | |||
-rw-rw-rw- 1 1663224 Jul 18 10:46 sessionFiles.g18585 | |||
-rw-rw-rw- 1 935 Jul 18 10:46 saveList.g18588 | |||
-rw-rw-rw- 1 1963973 Jul 18 10:46 alreadySaved.d18582 | |||
-rw-rw-rw- 1 31782116 Jul 18 11:00 trash.atime.S24127 | |||
-rw-rw-rw- 1 25326398 Jul 18 11:01 one.hour.S24127 | |||
-rw-rw-rw- 1 9604147 Jul 18 11:01 eight.hour.S24127 | |||
You will always find these two files here: | |||
-rw-rw-rw- 1 5133861 Jul 18 11:01 rr.8hour.egrep | |||
-rw-rw-rw- 1 650758 Jul 18 11:02 rr.72hour.egrep | |||
They are left here for perusal, they are the listings of the files that were removed | |||
during this cycle of the system. If you only see these two files here, the system | |||
should have completed successfully. When it fails, it will leave some of the other | |||
temporary files. In fact, these removed file listings are archived | |||
as logs in: | |||
'''/export/userdata/rrLog/removed/YYYY/MM/''' | |||
When any of these scripts encounter problems and do not remove their lock files, the system | |||
remains off until the lock files can be manually removed. email is sent to '''hiram''' when | |||
they are in this state as a reminder to check them. The log files should be examined to see | |||
if there is any real problem. The usual case is that some bottleneck was in place somewhere, | |||
the scripts merely ran into themselves after one of them failed. In this case, a simple | |||
removal of the lock files will allow the system to continue at their next cron job invocation: | |||
'''/var/tmp/qaTeamTrashMonitor.pid''' | |||
'''/export/userdata/cleaner.pid''' | |||
Before removing those files, check the log files to see if there is any message about | |||
what | |||
==Primary trash directory== | ==Primary trash directory== |
Revision as of 18:11, 18 July 2013
Overview
The trash cleaning system at UCSC has evolved from a simple one-line cron job that removed older files from the /trash/ directory into a complex set of interlocking scripts. This discussion outlines the procedures and lock files that keep the system running safely.
recovery from problems
Login to rrnfs1 and check if there are any currently running processes:
ps -ef | grep -i qateam
It may be the case that a previous instance simply hasn't completed yet. Let it finish, you do not want to interrupt this system.
If there is nothing running, check the most recent log file to see if there is any message about the problem in:
/export/userdata/rrLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt /export/userdata/betaLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt
Or the temporary files under construction in /var/tmp/ may have the error message from a failed command. Typical file names you may find there:
-rw-rw-rw- 1 85743056 Jul 18 10:46 refreshList.O18591 -rw-rw-rw- 1 1663224 Jul 18 10:46 sessionFiles.g18585 -rw-rw-rw- 1 935 Jul 18 10:46 saveList.g18588 -rw-rw-rw- 1 1963973 Jul 18 10:46 alreadySaved.d18582 -rw-rw-rw- 1 31782116 Jul 18 11:00 trash.atime.S24127 -rw-rw-rw- 1 25326398 Jul 18 11:01 one.hour.S24127 -rw-rw-rw- 1 9604147 Jul 18 11:01 eight.hour.S24127
You will always find these two files here:
-rw-rw-rw- 1 5133861 Jul 18 11:01 rr.8hour.egrep -rw-rw-rw- 1 650758 Jul 18 11:02 rr.72hour.egrep
They are left here for perusal, they are the listings of the files that were removed during this cycle of the system. If you only see these two files here, the system should have completed successfully. When it fails, it will leave some of the other temporary files. In fact, these removed file listings are archived as logs in:
/export/userdata/rrLog/removed/YYYY/MM/
When any of these scripts encounter problems and do not remove their lock files, the system remains off until the lock files can be manually removed. email is sent to hiram when they are in this state as a reminder to check them. The log files should be examined to see if there is any real problem. The usual case is that some bottleneck was in place somewhere, the scripts merely ran into themselves after one of them failed. In this case, a simple removal of the lock files will allow the system to continue at their next cron job invocation:
/var/tmp/qaTeamTrashMonitor.pid /export/userdata/cleaner.pid
Before removing those files, check the log files to see if there is any message about what
Primary trash directory
The current trash directory NFS server is on the server: rrnfs1
You can login to that machine via the qateam user.
A cron job running under the root user calls the scripts in the qateam directory. It is currently running once very 4 hours, at times: 00:10 04:10 08:10 12:10 16:10 20:10 The cluster admins maintain this root cron tab entry, it is a single command:
/home/qateam/trashCleaners/hgwbeta/trashCleanMonitor.sh searchAndDestroy
This hgwbeta/trashCleanMonitor.sh script is going to clean trash files for hgwbeta custom tracks, and then call the primary RR trashCleanMonitor.sh to do the big job of cleaning the RR custom tracks.
Cleaner lock file
The trashCleanMonitor.sh script uses a lock file to prevent it from overrunning an existing running instance of these scripts. When this lock file exists, the system will not start a new instance of the cleaners. It sends email to hiram as an alert that the cleaners are overrunning themselves. They normally will not overrun themselves if everything is OK. If a previous instance failed, the lock file remains in place to keep the cleaners off until the error is recognized and taken care of. The complete cleaner system must finish successfully to remove the lock file.
hgwbeta cleaner
This first script hgwbeta/trashCleanMonitor.sh has a simple job. It scans the namedSessionDb table in hgcentralbeta to take care of the trash files that belong to a saved session on hgwbeta. Trash files that are used from a saved session are moved out of the trash directory into
/export/userdata/ct/beta/
with a symlink left in the primary trash directory:
/export/trash/ct/someFile -> ../../userdata/ct/beta/someFile
The actual script that does this scanning, moving files, and symlinks is called from hgwbeta/trashCleanMonitor.sh in order to monitor the successful result of the called script:
/home/qateam/trashCleaners/hgwbeta/trashCleaner.csh
The trashCleanMonitor.sh verifies that script has completed successfully via not only its return code, but also the last line of the log file written by the script which must read: SUCCESS. The log file written by this script can be found in:
/export/userdata/betaLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt
where YYYY is the year, MM the month, DD the date, HH the hour at the time the script runs.
Upon successful completion of the hgwbeta/trashCleaner.csh script the monitor script runs an exec command for the primary RR cleaning script
exec /home/qateam/trashCleaners/rr/trashCleanMonitor.sh searchAndDestroy
the RR cleaner
The same monitor calling script setup is working for the RR cleaner. The primary script:
/home/qateam/trashCleaners/rr/trashCleanMonitor.sh
requires the lock file initiated by the beta cleaner to exist. This script will not run if the lock file /export/userdata/cleaner.pid does not exist.
This called script:
/home/qateam/trashCleaners/rr/trashCleaner.csh
performs the job of scanning the namedSessionDb table in hgcentral for any sessions for the RR system to do the same move and symlink trick as mentioned above. The moved files end up in:
/export/userdata/ct/rr/
with symlinks from trash:
/export/trash/ct/someFile -> ../../userdata/ct/rr/someFile
That scanning for files also causes an access to the associated custom trash database tables for each track. This updates the last accessed time in the custom trash database metaInfo table. This last access time is important for the removal of older custom trash database tables.
After those files are taken care of, the removal of trash database tables begins. This is done with the command /home/qateam/bin/x86_64/dbTrash with a 72 hour expiration time. Since the scanning of the named session touched the last access time of these database tables, they will survive this 72 hour expiration time.
After the custom trash database tables are cleaned, the removal of trash files begins. For performance purposes, the scanning of files and times in /export/trash/ needs to be done with a minimum of impact to the filesystem. There is a single find -type f command run on the /export/trash/ filesystem performed by a called script:
/home/qateam/cronScripts/trashMonV2.sh
That file list is used by a perl script to discover the last access times of the files in trash via a stat function in:
/home/qateam/dataAnalysis/betterTrashMonitor/fileStatsFromFind.pl
This method has been tested to show that it works very rapidly through very large file listings.
Those measuring scripts, as a side effect, maintain logs of data sizes for everything in trash. Those logs are accumulating in:
/home/qateam/trashLog/YYYY/MM/YYYY-MM-DD.HH:MM:SS
The result of the scanning scripts is a file listing with the last access time in seconds as temporary files in /var/tmp/
A simple awk of that last access time listing for the threshold expiration time produces a list of files to remove from the trash directory. Two different expiration times are in effect for different sections of the trash directory. Short lived files that are one-time use only by the browser are removed with an hour of expiration time. Custom track trash files and other files associated with browser generated data that can be used repeatedly by a user session are expired on a 64 hour expiration timeout.
The RR trashCleaner.csh script accumulates log files into:
/export/userdata/rrLog/YYYY/MM/cleanerLog.YYYY-MM-DDTHH.txt
When this script completes successfully, it removes the lock file: /export/userdata/cleaner.pid
The caller trashCleanMonitor.sh verifies a successful return code from trashCleaner.csh and a SUCCESS message in the cleanerLog file. If anything is failing, email is sent to hiram
trash measurement
To keep track of use statistics on the trash filesystem, the script mentioned above:
/home/qateam/cronScripts/trashMonV2.sh
is used by the trash cleaners and is also used by itself just to periodic measure the trash filesystem.
Since the trash cleaners are only running once every 4 hours, this measurement script is run during hours when the cleaners are not running. It is on the crontab of the qateam user on rrnfs1:
42 1,2,5,6,9,10,13,14,17,18,21,22 * * * nice -n 19 ~/cronScripts/measureTrash.sh
This measureTrash.sh script is calling /home/qateam/cronScripts/trashMonV2.sh and removing the temporary access time file created in /var/tmp/
It is also honoring the lock file used by trash cleaners to prevent it from overrunning their use of the measurement system: /export/userdata/cleaner.pid
The script trashMonV2.sh also has a lock file to prevent it from overrunning itself:
/var/tmp/qaTeamTrashMonitor.pid
recovery from problems
When any of these scripts encounter problems and do not remove their lock files, the system remains off until the lock files can be manually removed. email is sent to hiram when they are in this state as a reminder to check them. The log files should be examined to see if there is any real problem. The usual case is that some bottleneck was in place somewhere, the scripts merely ran into themselves after one of them failed. In this case, a simple removal of the lock files will allow the system to continue at their next cron job invocation:
/var/tmp/qaTeamTrashMonitor.pid /export/userdata/cleaner.pid