Getting started with Cell Browser wrangling

From Genecats
Jump to navigationJump to search

UCSC VPN

Connecting to the UCSC VPN is not required for much of your day-to-day work. However, it can be required to access papers in popular journals (e.g. Science or Nature) that aren’t open access.

To begin installing the campus VPN you can visit information technology services (ITS) UCSC VPN Installation Instructions. Since our work computers are not managed by ITS you can simply find and download an installer that works with your operating system. Once that is installed, open it up and complete Step 2: Establish a VPN Connection instructions.

Useful Bookmarks

Below is a non-exhaustive list of sites that you should bookmark:

If you use your web browser’s bookmark bar, it can be helpful to use one to two character labels for your bookmarks as it allows you to squeeze more into that bar. For example, the cells-test bookmark could be labeled ‘T’, cells-beta as ‘B’, and so on.

Accounts you should have

To wrangle for the Cell Browser, you should have the following accounts:

  • Github - you should be given ‘Write’ access to the cellbrowser-confs repo, and write access to the
  • Redmine - you should be given access to the 'Cells' project, use this for tracking bugs, features, releases, etc.

Directories you should have

There are certain directories that every Cell Browser wrangler should have set up

  • /hive/users/${hgwdev_username} - the /hive filesystem is where any operations done with large files should be done.
    • Within this directory you should have:
      • cb/ - a place where you can explore experimental cell browser datasets
      • tmp/ or temp/ - a place where you can do temp file operations or things that you know you will most likely delete later
  • /cluster/home/${hgwdev_username}
    • Within this directory you should have
      • bin/ - where you can put scripts and other utilities so that they are picked up by your PATH
      • public_html/ - you can put files here (including cell browsers) so that they are accessible via the web at https://hgwdev.gi.ucsc.edu/~${hgwdev_username}

Helpful papers to read

Here is an brief ‘App Note’ about the UCSC Cell Browser and an overview of its basic features:

This paper provides a great introduction to the process of single-cell analysis and should give you an idea as to how the UCSC Cell Browser fits into that process:

Conda

Importing data into the Cell Browser depends on two primary external tools: Seurat and Scanpy. The easiest way to install and manage these tools is using conda. Conda is a package management tool, similar to pip for python environments, however, you can install more than just python packages. You can use conda to install packages for R and more.

Step 1: Install miniconda

After you have logged into hgwdev, download the conda setup script to your home directory:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

Then run the setup script:

bash Miniconda3-latest-Linux-x86_64.sh

Enter ‘y’ when prompted throughout the installation process. Full installation instructions can be found on their website.

At the end, run this to immediately activate conda (otherwise, it will automatically be activated next time you log out/in to hgwdev):

source ~/.bashrc

Step 2: Set up a conda envs for Seurat & Scanpy

We will set up separate environments for Scanpy and Seurat. Conda provides good documentation if you want to learn more about managing your conda environments.

2a Scanpy Environment

Creating the environment

First, we’ll set up an environment in which you will install scanpy:

conda create --name scanpyenv python=3.9

It will do some work to find the packages that it needs to install. When prompted, type “y” to finish the setup and install everything.

After creating the env, activate it so that the packages you install next will be installed in that environment:

conda activate scanpyenv

Mamba

Mamba is essentially a fast wrapper for conda and drastically improves the speed of installing and upgrading packages. Install this package first, then use it to install the others:

conda install -c conda-forge mamba

Scipy & Colors

The “scipy” package is needed for working with mtx files and “webcolors” is allows for custom colors. Install both with this single command:

mamba install -c conda-forge scipy webcolors

Scanpy

Scanpy has quite a few dependencies. First, we have to install some supporting packages:

mamba install seaborn scikit-learn statsmodels numba pytables

Then we can install scanpy and more supporting packages via the ‘conda-forge’ channel:

mamba install -c conda-forge python-igraph leidenalg louvain scanpy

2b Seurat Environment

Creating the environment

After you’ve installed that, set up an environment for cellbrowser+seurat:

conda create --name seuratenv python=3.9

It will do some work to find the packages that it needs to install. When prompted, type “y” to finish the setup and install everything.

After creating the env, activate it:

conda activate seuratenv

Mamba

Mamba is essentially a fast wrapper for conda and drastically improves the speed of installing and upgrading packages. Install this package first, then use it to install the others:

conda install -c conda-forge mamba

Scipy & Colors

The “scipy” package is needed for working with files in mtx format and “webcolors” is needed for custom colors. Install both with this single command:

mamba install -c conda-forge scipy webcolors

R, Seurat, SeuratObject, R.oo, and R.utils

Finally, we can install R, Seurat, and a few other supporting packages:

mamba install -c conda-forge r r-seurat r-seuratobject r-r.utils r-r.oo 

Step 3: Make a copy of the Cell Browser Github

We will make a copy of the Cell Browser Github and set the default branch to be ‘develop’. This means that when you’re building cell browsers you’re always using the latest version of the tools. This also allows us to find and fix bugs in the Cell Browser command-line tools before they leak out to the pip release.

In your home directory, run the following command:

git clone https://github.com/ucscGenomeBrowser/cellBrowser.git

Next, check out the develop branch:

git checkout develop

After that, add the following lines to your .bashrc so that you automatically the right tools:

export PATH=$HOME/cellBrowser/src:$HOME/cellBrowser/ucsc

If you already have a ‘PATH’ line in your .bashrc, then just insert ‘$HOME/cellBrowser/src:$HOME/cellBrowser/ucsc’ at the very beginning of your PATH and separate it from the next item with a ‘:’.

Github SSH keys

On hgwdev, generate a new ssh key pair (substitute in your Github email):

ssh-keygen -t ed25519 -C "your_email@example.com"

You’ll be prompted for a file name and a passphrase, neither are required, so just hit ‘Enter’ both times to skip those steps.

Copy the public key from your terminal window:

cat ~/.ssh/id_ed25519.pub

Follow Github’s instructions from (2) onward to add this public key to your account.

Auto-updating the Github repo

You can set up a cron to keep this github repo up-to-date.

Open the crontab editor:

crontab -e

Add these lines to the top of your crontab:

SHELL=/bin/sh

MAILTO={your @ucsc.edu email address}

Then these lines anywhere in your crontab before saving and exiting:

# cell browser git update
16 6 * * 1-5 cd ~/cellBrowser/; git pull

This will go into your cellBrowser directory and do a ‘git pull’ at 6:16 am Monday through Friday every week, regardless of the date or month. IBM has a pretty detailed manual page for cron and this crontab.guru site seems like it can help you visualize how changing those columns affects the cron schedule.

Configuration files in your home directory

Setting up a .cellbrowser.conf

This is a file that exists in your home directory and helps set some Cell Browser-wide configuration options. Here are some essential lines to have in this file along with an explanation of why that line is important:

# Tells your cbBuild where data root for the cell browser is
# so that it can properly interpret and build the collection structure
dataRoot = "/hive/data/inside/cells/datasets/"
# Helps us with tracking site usage
gaTag = "UA-132481597-1"
# Shortcuts for use with cbBuild
outDirs = {"alpha" : "/usr/local/apache/htdocs-cells", "beta" : "/usr/local/apache/htdocs-cells-beta"}
# Forces us to remember to fill out these tags 
reqTags=["body_parts"]
# Forces us to remember that directories should be all lowercase 
onlyLower=True

Feel free to copy and paste these lines into the .cellbrowser.conf in your home directory.

Your .bashrc

This section covers settings to add to your ~/.bashrc that have been useful for others, though is not an exhaustive list. You may find it helpful to add your own shortcuts for commands, directories, and more as you wrangle data.

# Useful commands:
alias ls='ls --color=auto'
alias c='clear;pwd'
alias p='pwd -P'  # shows the "real" path in bash, not the path via symlinks

# Useful directory shortcuts
alias cells='cd /hive/data/inside/cells/datasets/'
alias cb='cd ~/cellBrowser/'

# This setting controls what’s shown at the beginning of your command prompt
# This will display the host + your working directory, e.g. [mspeir@hgwdev asthma-lung]
export PS1='[\u@\h \W]$ ' 
# There are many more customization options, see: https://www.cyberciti.biz/tips/howto-linux-unix-bash-shell-setup-prompt.html

# Your PATH is where bash will look for commands you run and will go through them in the order specified in your PATH. In essence it allows you to run a program like wigToBigWig 
without having to spell out where that utility lives, /cluster/bin/x86_64/wigToBigWig
export PATH=$HOME/cellBrowser/src:$HOME/cellBrowser/ucsc:/cluster/software/bin:/cluster/bin:/bin:/usr/bin:/cluster/bin/$MACHTYPE:/usr/local/bin:/cluster/bin/scripts:$HOME/bin/$MACHTYPE:$HOME/bin/:/cluster/bin/bedtools

Your .bash_profile

When you log onto hgwdev, the settings of your .bash_profile will automatically be loaded. These lines will ensure that the settings from your .bashrc are loaded at the same time:

if [ -f ~/.bashrc ]; then
    . ~/.bashrc
fi

Setting up password-less login to hgwdev

On your own computer run:

ssh-keygen -t ed25519

Copy the public key from your terminal window:

cat ~/.ssh/id_ed25519.pub

Then log into dev via password and paste that key into ~/.ssh/authorized_keys. (You may need to create this file if it doesn't exist.

Mosh (optional)

Installing and using Mosh is recommended, but optional. It allows you to leave long-running jobs going in a terminal window and not worry about processes being terminated if your computer goes to sleep or you change networks. Yes, you can do this with the Unix utility ‘screen’, but mosh simplifies the process greatly. With screen, your processes are kept running, but to access them again, you will need to log back into hgwdev and reconnect the screen, whereas with mosh, those windows remain connected, allowing you to get right back to what you were doing. Screen also behaves weirdly with conda envs (see this bug report).

See mosh’s installation instructions. If you’re on a Mac, it’s recommended to use homebrew instead of the ‘dmg’ as it should make future updates easier.

Once you’ve installed mosh, you will need to put these lines somewhere in your ~/.bashrc on both hgwdev and your own computer:

# mosh stuff
export PATH=/cluster/software/bin:$PATH
export LANG="en_US.UTF-8"
export LC_COLLATE=C

Once you've done that, you can log onto hgwdev by just swapping 'ssh' with 'mosh' in your normal log in command:

mosh <username>@hgwdev.gi.ucsc.edu