Running your own gfServer: Difference between revisions

From genomewiki
Jump to navigationJump to search
(Added instructions for dynamic gfServer)
(add systemd configuration)
 
(3 intermediate revisions by 2 users not shown)
Line 1: Line 1:
BLAT servers (gfServer) are configured as either static or dynamic servers.
BLAT servers (gfServer) are configured as either dedicated or dynamic servers.
Static BLAT serves index a genome when started and remain running in memory to quickly respond to request.  Dynamic BLAT servers pre-index genomes to files
Dedicated BLAT serves index a genome when started and remain running in memory to quickly respond to request.  Dynamic BLAT servers pre-index genomes to files
and are run on demand to handle a BLAT request and then exit.
and are run on demand to handle a BLAT request and then exit.


Static gfServer are easier to configure and faster to respond. However, the server
Dedicated gfServer are easier to configure and faster to respond. However, the server
continually uses memory.  A dynamic gfServer is more appropriate with multiple
continually uses memory.  A dynamic gfServer is more appropriate with multiple
assemblies and infrequent use.  Their response time is usually acceptable; however, it varies with the speed of the disk containing the index.  With
assemblies and infrequent use.  Their response time is usually acceptable; however, it varies with the speed of the disk containing the index.  With
Line 11: Line 11:
Both database-based assemblies or assembly hubs may be configured to use either type of BLAT server.
Both database-based assemblies or assembly hubs may be configured to use either type of BLAT server.


'''NOTE: dynamic BLAT servers are not yet available.  They are expected to be released in March 2021'''


== Configuring a dedicated gfServer ==
* If you want to run your own BLAT server you need a lot of spare memory on the machine.  You may also want to review our mailing list archives for [https://groups.google.com/a/soe.ucsc.edu/forum/?hl=en&fromgroups#!searchin/genome/gfServer gfServer troubleshooting advice].


== Configuring a static gfServer ==
* Run two instances of gfServer from http://yourServer.yourInstitution.edu at the location of yourAssembly.2bit file, specifying a port that the gfServer will be accessible from for amino acid (<code>-trans</code> option) and DNA searches. Please note the <code>-mask</code> option will ignore all lower-case assembly sequence, which is the convention the UCSC Browser uses for masked sequence, so you may not want to include it from the example below.
* If you want to run your own blat server you need a lot of spare memory on the machine.  You may also want to review our mailing list archives for [https://groups.google.com/a/soe.ucsc.edu/forum/?hl=en&fromgroups#!searchin/genome/gfServer gfServer troubleshooting advice].
  * When picking a port number, stick with numbers between 1024 and 49151. Anything less than 1024 is considered a system port and you'll need to be root in order to open it. Anything above 49151 is considered dynamic and randomly assigned.  
* You need two servers, one for protein queries, one for normal DNA queries.
* Add something like this to a startup file of your server, e.g. /etc/rc.d/rc.local:
  gfServer start blatMachine 33333 -stepSize=5 -log=/var/log/blatServerCi1.log /gbdb/ci1/ci1.2bit
  gfServer start blatMachine 33334 -trans -log=/var/log/blatServerCi1Trans.log /gbdb/ci1/ci1.2bit


* Add the server to hgCentral
* For example, these two lines will specify port 17777 for DNA searches
  update hgcentral.blatServers set host = "localhost", port=33333 where db="ci1" and isTrans=0;
   and 17779 for amino acid searches and are run from the publicly accessibly
   update hgcentral.blatServers set host = "localhost", port=33334 where db="ci1" and isTrans=1;
   directory location of yourAssembly.2bit file:
* If you're not running a protein server, remove its entry from hgCentral
   delete from hgcentral.blatServers where db="ci1" and isTrans=1;
* Tell the browser where to find the 2bit file:
  update dbDb set nibPath = "" where name="ci1";
* On RedHat you might need SELinux permissions: 
  sudo chcon --type=httpd_sys_content_t /gbdb/ci1/ci1.2bit


<pre>
  cd /genomes/yourAssembly
  gfServer start blatMachine 17777 -stepSize=5 -log=yourAssembly.untrans.log yourAssembly.2bit &
  gfServer start blatMachine 17779 -trans -mask -log=yourAssembly.untrans.log yourAssembly.2bit &
</pre>
Adding something like this to a startup file of your server, e.g. /etc/rc.d/rc.local,
will ensure they are started when your system is rebooted.
=== Configuring database genomes to use a dedicated gfServer ===
* Tell the browser where to find the 2bit file with the SQL commands:
<pre>
  update hgcentraldbDb set nibPath = "/genomes/yourAssembly" where name="yourAssembly";
</pre>
* On RedHat you might need SELinux permissions:
<pre>
  sudo chcon --type=httpd_sys_content_t /gbdb/yourAssembly/yourAssembly.2bit
</pre>
* Add the server to the hgcentral database with the SQL commands:
<pre>
  update hgcentral.blatServers set host = "localhost", port=17777 where db="yourAssembly" and isTrans=0;
  update hgcentral.blatServers set host = "localhost", port=17779 where db="yourAssembly" and isTrans=1;
</pre>
* If you're not running a protein server, remove its entry from hgcentral with the SQL command:
<pre>
  delete from hgcentral.blatServers where db="yourAssembly" and isTrans=1;
</pre>


== Configuring a dynamic gfServer ==
== Configuring a dynamic gfServer ==
A xinetd super-server starts a dynamic BLAT server to handle a single user request.  It loads a pre-built index from disk for the request  A single  
A dynamic BLAT server consists of gsServer being started on demand to handle a single user request.   
xinetd server handles multiple genomes and, nucleotide, protein-translated, and
It uses a pre-built index from disk for the request. A single configured ports server handles multiple genomes
and nucleotide, protein-translated, and
protein queries.  Genomes indexes must be pre-built, with all of them installed
protein queries.  Genomes indexes must be pre-built, with all of them installed
or linked under a common directory hierarchy, called the gfServer root
or linked under a common directory hierarchy called the gfServer root
directory.
directory.
The dynamic gfServer is started by xinetd or systems, depending on your UNIX / Linux distribution.


=== Configuring xinetd ===
=== Configuring xinetd ===
Line 64: Line 86:
}
}
</pre>
</pre>
=== Configuring systemd ===
Configure logging in  /etc/rsyslog.d/listen.conf
<pre>
$SystemLogSocketName /run/systemd/journal/syslog
local0.*            /var/log/dynGfServer
</pre>
Then restart rsyslogd:
<pre>
% systemctl restart rsyslog
</pre>
Create /etc/systemd/system/blat.socket
<pre>
[Unit]
Description=gfServer Activation Socket
ConditionPathExists=/scratch/hubs
[Socket]
ListenStream=0.0.0.0:4040
MaxConnections=50
Accept=yes
[Install]
WantedBy=sockets.target
WantedBy=multi-user.target
</pre>
Create /etc/systemd/system/blat@.service
<pre>
[Unit]
Description=gfServer Server
Requires=blat.socket
[Service]
ExecStart=/scratch/gfServer -syslog -logFacility=local0 dynserver
/scratch/hubs
StandardInput=socket
User=blatuser
Group=genecats
</pre>
Restart systemd daemons
<pre>
% systemctl daemon-reload
</pre>
Activate the blat socket:
<pre>
% systemctl enable blat.socket
% systemctl start blat.socket
</pre>
Now you can view the new socket's status:
<pre>
% systemctl status blat.socket
● blat.socket - gfServer Activation Socket
    Loaded: loaded (/etc/systemd/system/blat.socket; enabled; preset:
disabled)
    Active: active (listening) since Sat 2023-09-09 19:25:22 PDT; 30min ago
      Until: Sat 2023-09-09 19:25:22 PDT; 30min ago
  Triggers: ● blat@67-128.114.119.165:4040-128.114.119.131:35906.service
            ● blat@71-128.114.119.165:4040-198.199.102.83:37248.service
            ● blat@68-128.114.119.165:4040-128.114.119.131:35990.service
    Listen: 0.0.0.0:4040 (Stream)
  Accepted: 88; Connected: 0;
      Tasks: 0 (limit: 3301797)
    Memory: 8.0K
        CPU: 1ms
    CGroup: /system.slice/blat.socket
Sep 09 19:25:22 dynablat-01.soe.ucsc.edu systemd[1]: Listening on
gfServer Activation Socket.
</pre>
An 'lsof -Pi' will show the socket listening on port 4040:
<pre>
% lsof -Pi | grep 4040
systemd      1  root  40u  IPv4  24469      0t0  TCP *:4040 (LISTEN)
</pre>




Line 71: Line 181:
convention:
convention:


* myGenome.2bit - two-bit format genomic sequence
* yourAssembly.2bit - two-bit format genomic sequence
* myGenome.untrans.gfidx - untranslated index
* yourAssembly.untrans.gfidx - untranslated index
* myGenome.trans.gfidx - translated index
* yourAssembly.trans.gfidx - translated index


Where myGenome is the database or hub name of the assembly.  For
Where yourAssembly is the database or hub name of the assembly.  For
database-based assemblies, the files are stored in a directory with the
database-based assemblies, the files are stored in a directory with the
name as the assembly database, such as ''rootdir/myGenome/''.  For assembly
name as the assembly database, such as ''rootdir/yourAssembly/''.  For assembly
hubs, they may follow this convention or use more deeply nested directories
hubs, they may follow this convention or use more deeply nested directories
such as ''rootdir/GCF/000/181/335/GCF_000181335.3/''.
such as ''rootdir/GCF/000/181/335/GCF_000181335.3/''.
Line 83: Line 193:
The gfServer parameters are stored with the index and are specified when the index is created. The following commands will build the indexes:
The gfServer parameters are stored with the index and are specified when the index is created. The following commands will build the indexes:
<pre>
<pre>
   gfServer index -stepSize=5 myGenome.untrans.gfidx myGenome.2bit
   gfServer index -stepSize=5 yourAssembly.untrans.gfidx yourAssembly.2bit
   gfServer index -trans myGenome.trans.gfidx myGenome.2bit
   gfServer index -trans yourAssembly.trans.gfidx yourAssembly.2bit
</pre>
</pre>


Line 98: Line 208:
To change an existing genome to use tghe dynamic gfServer, use the SQL commands:
To change an existing genome to use tghe dynamic gfServer, use the SQL commands:
<pre>
<pre>
   update hgcentral.blatServers SET host = "localhost", port=5010, dynamic=1 where db="ci1" and isTrans=0;
   update hgcentral.blatServers SET host = "localhost", port=5010, dynamic=1 where db="yourAssembly" and isTrans=0;
   update hgcentral.blatServers SET host = "localhost", port=5010, dynamic=1 where db="ci1" and isTrans=1;
   update hgcentral.blatServers SET host = "localhost", port=5010, dynamic=1 where db="yourAssembly" and isTrans=1;
</pre>
</pre>




=== Configuring assembly hubs to use a dynamic gfServer ===
===RAM requirements for BLAT servers===
A dynamic BLAT server is specified with the "dynamic" argument to
the blat, transBlat, isPcr definitions in the hub genome.txt file, followed by
the gfServer root-relative path of the directory
containing the 2bit and gfidx files.
 
For example:
<pre>
  blat yourServer.yourInstitution.edu 4096 dynamic myGenome
  transBlat yourServer.yourInstitution.edu 4096 dynamic myGenome
  isPcr yourServer.yourInstitution.edu 4096 dynamic myGenome
</pre>


The gfServers that provide responses for blat queries can take some amount of memory.
Here is some information that might help in approximating the required amount for genomes of different sizes.


The genome and gfServer indexes would be:
::The human hg19 genome requires ~2.2GB for the translated amino acid gfServer queries
<pre>
::and ~2.2GB for the untranslated DNA gfServer queries representing ~3,137,161,000 bp.
    $rootdir/myGenome/myGenome.2bit
    $rootdir/myGenome/myGenome.untrans.gfidx
    $rootdir/myGenome/myGenome.trans.gfidx
</pre>


For large hubs, it is possible to have more deeply nest directory, for
::The zebrafish danRer7 genome requires ~1.2GB for the translated amino acid gfServer queries
instance, the following NCBI convention:
::and ~1.1GB for the untranslated DNA gfServer queries representing ~1,412,465,000 bp.
<pre>
  blat yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3
  transBlat yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3
  isPcr yourServer.yourInstitution.edu 4096 dynamic GCF/000/181/335/GCF_000181335.3
</pre>


Which will reference these genome files and indexes:
::The D. melanogaster dm6 genome requires ~300MB for the translated amino acid gfServer queries
<pre>
::and ~250MB for the untranslated DNA gfServer queries representing ~143,726,000 bp.
    $rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.2bit
    $rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.untrans.gfidx
    $rootdir/GCF/000/181/335/GCF_000181335.3/GCF_000181335.3.trans.gfidx
</pre>

Latest revision as of 20:12, 27 June 2024

BLAT servers (gfServer) are configured as either dedicated or dynamic servers. Dedicated BLAT serves index a genome when started and remain running in memory to quickly respond to request. Dynamic BLAT servers pre-index genomes to files and are run on demand to handle a BLAT request and then exit.

Dedicated gfServer are easier to configure and faster to respond. However, the server continually uses memory. A dynamic gfServer is more appropriate with multiple assemblies and infrequent use. Their response time is usually acceptable; however, it varies with the speed of the disk containing the index. With repeated access, the operating system will cache the indexes in memory, improving response time.

Both database-based assemblies or assembly hubs may be configured to use either type of BLAT server.


Configuring a dedicated gfServer

  • If you want to run your own BLAT server you need a lot of spare memory on the machine. You may also want to review our mailing list archives for gfServer troubleshooting advice.
  • Run two instances of gfServer from http://yourServer.yourInstitution.edu at the location of yourAssembly.2bit file, specifying a port that the gfServer will be accessible from for amino acid (-trans option) and DNA searches. Please note the -mask option will ignore all lower-case assembly sequence, which is the convention the UCSC Browser uses for masked sequence, so you may not want to include it from the example below.
 * When picking a port number, stick with numbers between 1024 and 49151. Anything less than 1024 is considered a system port and you'll need to be root in order to open it. Anything above 49151 is considered dynamic and randomly assigned. 
  • For example, these two lines will specify port 17777 for DNA searches
 and 17779 for amino acid searches and are run from the publicly accessibly
 directory location of yourAssembly.2bit file:
  cd /genomes/yourAssembly
  gfServer start blatMachine 17777 -stepSize=5 -log=yourAssembly.untrans.log yourAssembly.2bit &
  gfServer start blatMachine 17779 -trans -mask -log=yourAssembly.untrans.log yourAssembly.2bit &

Adding something like this to a startup file of your server, e.g. /etc/rc.d/rc.local, will ensure they are started when your system is rebooted.

Configuring database genomes to use a dedicated gfServer

  • Tell the browser where to find the 2bit file with the SQL commands:
  update hgcentraldbDb set nibPath = "/genomes/yourAssembly" where name="yourAssembly";
  • On RedHat you might need SELinux permissions:
  sudo chcon --type=httpd_sys_content_t /gbdb/yourAssembly/yourAssembly.2bit
  • Add the server to the hgcentral database with the SQL commands:
  update hgcentral.blatServers set host = "localhost", port=17777 where db="yourAssembly" and isTrans=0;
  update hgcentral.blatServers set host = "localhost", port=17779 where db="yourAssembly" and isTrans=1;
  • If you're not running a protein server, remove its entry from hgcentral with the SQL command:
  delete from hgcentral.blatServers where db="yourAssembly" and isTrans=1;

Configuring a dynamic gfServer

A dynamic BLAT server consists of gsServer being started on demand to handle a single user request. It uses a pre-built index from disk for the request. A single configured ports server handles multiple genomes and nucleotide, protein-translated, and protein queries. Genomes indexes must be pre-built, with all of them installed or linked under a common directory hierarchy called the gfServer root directory.

The dynamic gfServer is started by xinetd or systems, depending on your UNIX / Linux distribution.

Configuring xinetd

The xinetd, or the older inetd server is a standard package on UNIX /Linux systems. It is a facility that runs a program to handle an internet server request. A system administrator generally configures it. The server runs the services as an unprivileged users. Please see your operating system documentation for more details.

An example configuration file below. It launches gfServer with two arguments, the literal string "dynserver" and the gfServer root directory path.

service blat
{
         port            = 5010
         socket_type     = stream
         wait            = no
         user            = blatuser
         group           = genecats
         server          = /mnt/data/dyn-blat/bin/gfServer
         server_args     = dynserver /mnt/data/dyn-blat/genomes
         type            = UNLISTED
         log_on_success  += USERID EXIT
         log_on_failure  += USERID
         disable         = no
}

Configuring systemd

Configure logging in /etc/rsyslog.d/listen.conf

$SystemLogSocketName /run/systemd/journal/syslog
local0.*             /var/log/dynGfServer

Then restart rsyslogd:

% systemctl restart rsyslog

Create /etc/systemd/system/blat.socket

[Unit]
Description=gfServer Activation Socket
ConditionPathExists=/scratch/hubs

[Socket]
ListenStream=0.0.0.0:4040
MaxConnections=50
Accept=yes

[Install]
WantedBy=sockets.target
WantedBy=multi-user.target

Create /etc/systemd/system/blat@.service

[Unit]
Description=gfServer Server
Requires=blat.socket

[Service]
ExecStart=/scratch/gfServer -syslog -logFacility=local0 dynserver 
/scratch/hubs
StandardInput=socket
User=blatuser
Group=genecats

Restart systemd daemons

% systemctl daemon-reload

Activate the blat socket:

% systemctl enable blat.socket
% systemctl start blat.socket

Now you can view the new socket's status:

% systemctl status blat.socket
● blat.socket - gfServer Activation Socket
     Loaded: loaded (/etc/systemd/system/blat.socket; enabled; preset: 
disabled)
     Active: active (listening) since Sat 2023-09-09 19:25:22 PDT; 30min ago
      Until: Sat 2023-09-09 19:25:22 PDT; 30min ago
   Triggers: ● blat@67-128.114.119.165:4040-128.114.119.131:35906.service
             ● blat@71-128.114.119.165:4040-198.199.102.83:37248.service
             ● blat@68-128.114.119.165:4040-128.114.119.131:35990.service
     Listen: 0.0.0.0:4040 (Stream)
   Accepted: 88; Connected: 0;
      Tasks: 0 (limit: 3301797)
     Memory: 8.0K
        CPU: 1ms
     CGroup: /system.slice/blat.socket

Sep 09 19:25:22 dynablat-01.soe.ucsc.edu systemd[1]: Listening on 
gfServer Activation Socket.

An 'lsof -Pi' will show the socket listening on port 4040:

% lsof -Pi | grep 4040
systemd       1   root   40u  IPv4  24469      0t0  TCP *:4040 (LISTEN)


Building gfServer indexes

Three files are required by dynamic gfServers and must follow the naming convention:

  • yourAssembly.2bit - two-bit format genomic sequence
  • yourAssembly.untrans.gfidx - untranslated index
  • yourAssembly.trans.gfidx - translated index

Where yourAssembly is the database or hub name of the assembly. For database-based assemblies, the files are stored in a directory with the name as the assembly database, such as rootdir/yourAssembly/. For assembly hubs, they may follow this convention or use more deeply nested directories such as rootdir/GCF/000/181/335/GCF_000181335.3/.

The gfServer parameters are stored with the index and are specified when the index is created. The following commands will build the indexes:

  gfServer index -stepSize=5 yourAssembly.untrans.gfidx yourAssembly.2bit
  gfServer index -trans yourAssembly.trans.gfidx yourAssembly.2bit


Configuring database genomes to use a dynamic gfServer

Existing mirrors will need to add a column "dynamic" to hgcentral.blatServers with the following SQL command:.

  alter table hgcentral.blatServers add column dynamic tinyint not null default 0;

To change an existing genome to use tghe dynamic gfServer, use the SQL commands:

  update hgcentral.blatServers SET host = "localhost", port=5010, dynamic=1 where db="yourAssembly" and isTrans=0;
  update hgcentral.blatServers SET host = "localhost", port=5010, dynamic=1 where db="yourAssembly" and isTrans=1;


RAM requirements for BLAT servers

The gfServers that provide responses for blat queries can take some amount of memory. Here is some information that might help in approximating the required amount for genomes of different sizes.

The human hg19 genome requires ~2.2GB for the translated amino acid gfServer queries
and ~2.2GB for the untranslated DNA gfServer queries representing ~3,137,161,000 bp.
The zebrafish danRer7 genome requires ~1.2GB for the translated amino acid gfServer queries
and ~1.1GB for the untranslated DNA gfServer queries representing ~1,412,465,000 bp.
The D. melanogaster dm6 genome requires ~300MB for the translated amino acid gfServer queries
and ~250MB for the untranslated DNA gfServer queries representing ~143,726,000 bp.