Migration to hive: Difference between revisions
(Draft guidelines for migrating our files from old NFS servers to the new /hive.) |
m (→GPFS vs. NFS, or how to move directories: clarifications) |
||
Line 15: | Line 15: | ||
==GPFS vs. NFS, or how to move directories== | ==GPFS vs. NFS, or how to move directories== | ||
For years we have been mindful of the difference between local disk (fast) and NFS (slow but bigger). Now we have a new filesystem to consider: /hive uses GPFS. All /cluster/* are still NFS-mounted. This means that although the old /cluster/store* have been physically moved to /hive/archive/store*, | For years we have been mindful of the difference between local disk (fast) and NFS (slow but bigger). Now we have a new filesystem to consider: /hive uses GPFS. All /cluster/* are still NFS-mounted. In order for mv to act like a rename (like we expect), the OS must see the source and destination as on the same filesystem. This means that although the old /cluster/store* contents have been physically moved to /hive/archive/store*, /cluster/store* and /hive are not recognized by the OS as the same filesystem, and mv doesn't behave as we'd like! So when you move data out of /hive/archive/* and into /hive/.../, make sure you use /hive for both the source and destination paths, on a machine with native GPFS support for hive (like hgwdev or swarm), while nobody is modifying anything in the directory: | ||
% df -h /hive | % df -h /hive | ||
Line 23: | Line 23: | ||
% mv '''/hive/archive/'''store6/myProject '''/hive/'''users/me/ | % mv '''/hive/archive/'''store6/myProject '''/hive/'''users/me/ | ||
That is just a rename operation, and it's practically instantaneous. (Please do ''not'' mv from /cluster to /hive -- this results in a copy instead of a move; it is slow, all file ownership | That is just a rename operation, and it's practically instantaneous. (Please do ''not'' mv from /cluster to /hive -- this results in a copy instead of a move; it is slow, all file ownership information is lost, and you might not have permissions to move some of the files, resulting in errors and incomplete results.) | ||
==New /hive paths== | ==New /hive paths== |
Revision as of 19:50, 15 September 2008
Our godlike cluster-admins have provided us with a giant and hopefully scalable virtual disk, /hive. We hope to use this superdisk for data build directories and cluster run i/o, replacing /cluster/store* as well as bluearc and san. We expect this to greatly simplify our build processes: all files will be in a logical place under /hive. No more hunting around /cluster/* and following symlinks to the actual storage! No more genome build stuff stored on /san and bluearc for lack of space! No more rsyncing files to and from san and bluearc for cluster runs!
However, in order to enjoy those benefits, we have some work to do:
- move everything (of value) from /cluster/store*, san and bluearc to their new logical places under /hive.
- update symlinks in /gbdb, htdocs/goldenPath and /cluster/data to point to the new /hive locations.
- update all script references to the old fileservers
- when using old doc/*.txt templates, replace old file paths with new; also, don't stage stuff on other disks for cluster runs (except for cluster nodes' local /scratch disk), just use /hive.
When the migration to /hive is complete, /cluster/store* will disappear, having been completely subsumed by /hive. /cluster/data will stick around a bit longer, but ultimately all uses of /cluster/data will be replaced by their corresponding /hive paths, and /cluster/data will be retired as well.
Old stuff: on /hive/archive (for now)
Currently, /cluster/storeN is an NFS-mounted version of /hive/archive/storeN. /san/sanvol1 has been backed up as /hive/archive/SanVol1. /cluster/bluearc is gone; /hive/archive/bluearc is a fairly recent backup.
They will not last forever -- we have a limited time to move stuff that we want to keep out of /hive/archive/* and into one of the new /hive/ paths. (Negotiating deadline w/cluster-admin.) After the cutoff date(s), /hive/archive/* will disappear, their contents will be archived to tape, and the disk space freed up.
GPFS vs. NFS, or how to move directories
For years we have been mindful of the difference between local disk (fast) and NFS (slow but bigger). Now we have a new filesystem to consider: /hive uses GPFS. All /cluster/* are still NFS-mounted. In order for mv to act like a rename (like we expect), the OS must see the source and destination as on the same filesystem. This means that although the old /cluster/store* contents have been physically moved to /hive/archive/store*, /cluster/store* and /hive are not recognized by the OS as the same filesystem, and mv doesn't behave as we'd like! So when you move data out of /hive/archive/* and into /hive/.../, make sure you use /hive for both the source and destination paths, on a machine with native GPFS support for hive (like hgwdev or swarm), while nobody is modifying anything in the directory:
% df -h /hive Filesystem Size Used Avail Use% Mounted on /dev/hivedev 160T 59T 102T 37% /hive % mv /hive/archive/store6/myProject /hive/users/me/
That is just a rename operation, and it's practically instantaneous. (Please do not mv from /cluster to /hive -- this results in a copy instead of a move; it is slow, all file ownership information is lost, and you might not have permissions to move some of the files, resulting in errors and incomplete results.)
New /hive paths
/hive has several subdirectories, to provide some organization for the new namespace:
- /hive/data, which in turn has a few subdirectories:
- /hive/data/genomes: genome database build directories, e.g. /hive/data/genomes/sacCer1
- /hive/data/outside: external database downloads, e.g. /hive/data/outside/ncbi
- /hive/data/inside: internally-built non-genome databases, e.g. /hive/data/inside/visiGene
- /hive/users: personal projects, e.g. /hive/users/kent
- /hive/groups: group projects, e.g. /hive/groups/qa
How to move part 2: after the mv command
The mv from /hive/archive/... to /hive/... takes no time at all, but you may need to do some followup.
- If you move a genome database build directory $db, update its /cluster/data/$db symlink to point to /hive/data/genomes/$db. This applies to external-db directories too, e.g. /cluster/data/ncbi -> /hive/data/outside/ncbi.
- On hgwdev, look for any symlinks to the old location in /gbdb/$db and /usr/local/apache/htdocs/goldenPath/$db, and update if necessary:
find /gbdb/$db /usr/local/apache/htdocs/goldenPath/$db -type l -ls | grep /cluster/store
san and bluearc: what to move?
Short answer: anything that should have been in /cluster/store*, but wasn't because space was tight.
The bluearc is gone due to hardware failure, so any run results that we care about must be moved from /hive/archive/bluearc to /hive/{data,groups,users}.
The san is still up and running, but many of us got in the bad habit of storing large datasets there. cluster-admin threatens (rightly) to remove old stuff from there in order to free up some space for san's intended use: temporary disk for cluster runs. So, arduous as it may be, we really should dig through the old stuff on san and move genome db build pieces into /hive/data/genomes/$db/bed/ where they will be safe.
And the san will not live forever... but the plan is that the hive will. :)