ABRF2010 Tutorial

From genomewiki
Jump to navigationJump to search

Presented by Angie Hinrichs and Robert Kuhn March 20, 2010 as part of the workshop Tools to Facilitate Management, Analysis and Visualization of 2nd Generation Sequencing Data at ABRF 2010.

Intro: Basic Genome Browser navigation

Here we will explore the Genome Browser to get familiar with moving around the landscape and load a simple Custom Track

Navigation -- find a gene

There are several ways to get started. We will start with a clean "cart", select a genome assembly and enter a gene name.

Steps

  1. Go to http://genome.ucsc.edu and hit the button "Genome Browser at the upper left in the blue bar.
  2. Click here to reset (below the pulldown menus). This puts us all at the same place.
  3. Select assembly hg18 (has richer annotations than the newer hg19, which is still being populated with data sets).
  4. In the "gene" box, slowly type "PIGP." You will see the choices of gene names appear in the pulldown as you type.
  5. Got to the gene, PIGP, by hitting the "jump" button.

Navigation -- zooming, panning and controlling data tracks

To get around in the browser view, you can move to any continuously variable region of your choosing.

Steps

  1. Above the browser image, hit the "zoom out 3x" button to see the region around the gene.
  2. Data in the various "tracks" are drawn from separate tables in the browser database.
  3. Note the spikes of conserved sequence among the species in the Conservation track.
  4. Position your cursor at the top of the image in the scale bar above one of the exons. Hold down the mouse button and drag the mouse right or left to highlight the exon. Release the button to expand the exon to full screen.
  5. We'll turn off some of the default tracks to simplify the view. As we are concerned about next-gen sequencing in this workshop, we'll turn off the Array track. Near the bottom of the browser image, find the track, "SNP Genotyping Arrays" and click the little minibutton to the left side of the track.
  6. On the configuration page, we could turn on/off any of the subtracks. Select "hide" from the pulldown and Submit. Each data track has configuration options accessible via the little button.
  7. Below the image, in the track controls, find these tracks and select "hide": RefGene, Human mRNAs, Spliced ESTs, Repeat Masker. Select "Refresh" from any blue bar.
  8. The UCSC Genes Track, SNPs, and Conservation are still displayed. Note the little chevron arrows on UCSC Gene, PIGP, showing that it is transcribed from the negative strand. Click it to see rich information about the gene.
  9. Return to the browser via the link the blue bar at the top.

Custom Tracks -- add your own data

We will load a simple Custom Track to see how user-supplied data can be added to the data already in the browser.

Steps

  1. Below the browser image, select "add custom tracks."
  2. In the upper box, type (or paste):
 track name="test custom track" visibility=pack
 chr21 10000000 10050000
 chr21 10025000 10100000
 This is BED 3 format:  chrom, chromStart, chromEnd
  1. Select "Submit."
  2. At the browser, select "zoom out 10x"


Tutorial: Viewing sequencing data in the UCSC Genome Browser

This page serves as a step-by-step guide to the demo of UCSC Genome Browser tools presented at the workshop. Earlier in the workshop, Charlie Nicolet shared some sequencing and analysis results on chromosome 21 for ChIP-Seq of RNA polymerase II to HeLa cells. Thanks to Charlie for allowing us to post copies of his data and use them as a basis for this tutorial.

In case internet fails at the workshop, slides will be used (File:ABRF2010 UCSC NGS.ppt | File:ABRF2010 UCSC NGS.pdf).

Genome Graphs with .sgr data

Some of the result files from the ChIPSeq toolset that Charlie Nicolet showed us earlier are in the .sgr format, i.e. tab-separated text that specifies chromosome, position and a numeric signal value. This format can be uploaded to UCSC's Genome Graphs tool for genome-at-a-glance view of the data.

Steps

  1. In a new web browser window or tab, go to http://genome.ucsc.edu/ .
  2. On the left side of the page, there is a blue sidebar. Click on "Genome Graphs", the 7th link from the top of the sidebar.
  3. In the assembly menu, change the selection to "Mar. 2006 (NCBI36/hg18)".
  4. Click the upload button.
  5. In the input labeled "name of data set", type or paste "redbin".
  6. Paste this URL: http://genome-test.gi.ucsc.edu/ABRF2010/chr21_extended.txt_redbin.sgr in the box labeled "Paste URLs or data", and click submit.
  7. You should see confirmation screen; click OK to return to the Genome Graphs display.
  8. Now you should see a layout of human chromosomes. Genome Graphs doesn't show the data until you select it. In the graph menu on the left, change "--nothing--" to "redbin 1". Now the data appear -- in this case, for chr21 only.

catch up

If you don't see blue data appearing over chromosome 21, go to http://genome.ucsc.edu/cgi-bin/hgGenome?hgS_doOtherUser=submit&hgS_otherUserName=AngieHinrichs&hgS_otherUserSessionName=ABRF2010_GG to catch up.

Now we will use Genome Graphs to navigate to the Genome Browser.

  1. Click on a peak to jump to the Genome Browser, displaying a 1,000,000-base region of the genome corresponding to the location you clicked in Genome Graphs.
  2. Note the track at the top labeled "redbin 1" on the left and "User Supplied Track 1" above -- that is the data shown in Genome Graphs. Click on the top label "User Supplied Track 1" to make the peaks taller and more visible. In general, peaks correspond to upstream regions of genes, which makes sense for RNA polymerase II ChIP-seq.

catch up

If you don't see the above, go to http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=AngieHinrichs&hgS_otherUserSessionName=ABRF2010_redbin_1 to catch up.

Transforming .sgr into bedGraph for Genome Browser viewing

The Genome Browser can draw a somewhat nicer display of .sgr data, if we do a simple transform of the .sgr data into a similar format called bedGraph. bedGraph has both start and end genomic coordinates, so scores can be drawn over regions as opposed to single bases. Here is a command that translates .sgr to bedGraph, expanding each data point to cover a 30-base region:

   perl -wpe '@w=split("\t"); $_ = "$w[0]\t" . ($w[1]-1) . "\t" . ($w[1]+29) . "\t$w[2]";' \
     chr21_extended.txt_redbin.sgr > redbin.bed

If you would like to show only datapoints above a certain threshold, say 5, you can make the command slightly more complicated:

   perl -wpe '@w=split("\t"); $_ = "$w[0]\t" . ($w[1]-1) . "\t" . ($w[1]+29) . "\t$w[2]"; \
              if ($w[2] < 15) {$_ = "";}' \
     chr21_extended.txt_redbin.sgr > redbinGt15.bed

This line can be added to the beginning of redbinGt15.bed, in order to add labels to the track display and configure default appearance (see http://genome.ucsc.edu/goldenPath/help/bedgraph.html for more information about settings). A line break is used for readability here, but must be removed if you are going to edit this track line into a file:

   track name=redbinGt15 description="redbin items with score >= 15, expanded to 30 bases" \
     type=bedGraph viewLimits=0:100 autoScale=off maxHeightPixels=128:20:11

The file has been saved here: http://genome-test.gi.ucsc.edu/ABRF2010/redbinGt15.bed .

Steps

  1. In the Genome Browser, click the "manage custom tracks" button below the image.
  2. Click on "add custom tracks"
  3. Paste these two lines into the box labeled "Paste URLs or data" and click submit:
http://genome-test.gi.ucsc.edu/ABRF2010/redbinGt15.bed
browser position chr21:42,886,797-43,886,796
  1. You should see a confirmation page like this (image)
  2. To go to the genome browser, you can click on "Genome Browser" in the top blue bar, or click on "chr21" in the Pos column of the table showing your custom tracks, or on the "go to genome browser" button. Note: in the absence of a "browser position" line, only the top blue bar link will return you to the position you were looking at before.
  3. Now the new custom track shows grayscale representations of the score. You can click on the track title ("redbin items with score > 15") to expand the display to bar graph, which should look a lot like the "redbin 1" track created in Genome Graphs.

catch up

If you don't see two tracks labeled "redbin 1" and "redbinGt15" over the UCSC Genes track, go to http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=AngieHinrichs&hgS_otherUserSessionName=ABRF2010_redbinGt15 to catch up.

Comparing with UCSC-hosted data

Now we will take a look at some ChIP-seq data from the ENCODE project. The "Yale TFBS" track in the Regulation section combines ChIP-seq data from several labs, and it happens to include data for RNA polymerase II ("Pol2") in HeLa-S3 cells, like Charlie Nicolet's chr21 data. We can visually compare the two datasets to see how well they correspond.

Steps

  1. In the Genome Browser, scroll down below the image to the Regulation section. Click the "+" button on the left side of the blue "Regulation" header to expand.
  2. Click on the link titled "Yale TFBS". This will take you to the track controls.
  3. In the large table under "Select subtracks by cell line and factor",
  4. Hit the "-" button in upper left corner to deselect all subtracks. Note that HeLa-S3 cell line is in 3rd column of checkboxes.
  5. Scroll down to factor Pol2. select the 3rd check box. Only HeLa/Pol2 is selected for display.
  6. Scroll back up to the top of the page. Change visibility to "full" and click submit.
  7. In the Genome Browser, HeLa/Pol2 peaks and signal appear below "Spliced ESTs".

The intervening tracks make it a little difficult to compare the ENCODE data with Charlie's, so we will adjust the display to see only the tracks we want to compare.

  1. Click the "hide all" button under the image.
  2. In the Custom Tracks section below the image, change the visibilities of redbin 1 and redbinGt15 to "full".
  3. In the Genes and Gene Prediciton Tracks section, change the visibility of UCSC Genes to "dense".
  4. In the Regulation section, change the visibility of Yale TFBS to "full".
  5. Click the "refresh" button on any blue section header.

catch up

If you don't see a clear correspondence between the track on the bottom and the tracks on the top, go to http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=AngieHinrichs&hgS_otherUserSessionName=ABRF2010_TFBS to catch up.

Viewing alignments in BAM format

Charlie Nicolet has shared the ELAND export file that accompanies chr21_extended.txt, chr21_export.txt. This can be transformed into BAM for viewing in the UCSC Genome Browser. The samtools package includes a script that converts ELAND export files into SAM (the text version of BAM). Export files often have fasta names instead of sequence names, e.g. "chr21.fa" instead of "chr21", so the ".fa" must be removed:

   export2sam.pl --read1=chr21_export.txt \
   | perl -wpe 's/(chr.*)\.fa/$1/' \
     > chr21.sam

Edit chr21.sam to add these header lines at the beginning (note: the words must be separated by tab characters, not spaces):

@HD	VN:1.0	GO:none	SO:coordinate
@SQ	SN:chr21	LN:46944323

Note that current versions of samtools do not require the @SQ fields, you can also add the -t option to the "samtools view" command below and specify a chromSizes.txt file in that command (from the chromInfo table, e.g. using the fetchChromSizes script).

Use samtools to convert SAM to BAM, sort the BAM, and index the sorted BAM (see http://genome.ucsc.edu/goldenPath/help/bam.html for more info):

   samtools view -bSo chr21.unsorted.bam chr21.sam
   samtools sort chr21.unsorted.bam chr21
   samtools index chr21.bam

Now the sorted chr21.bam and its index file chr21.bam.bai must be put on a web server (ftp, http, or https). For example, if you create files my.bam and my.bam.bai, and you have a personal web directory http://my.edu/mylab/~me/, you can copy or link the files to http://my.edu/mylab/~me/my.bam and http://mylab/~me/my.bam.bai. Then you would use only the first URL (http://my.edu/mylab/~me/my.bam) in a track definition line like the one below. For this example, we will use http://genome-test.gi.ucsc.edu/ABRF2010/chr21.bam .

Steps

  1. In the Genome Browser, click the "manage custom tracks" buttom below the image.
  2. Click the "add custom tracks" button.
  3. Paste this text into the box:
track name=chr21_export type=bam bigDataUrl=http://genome-test.gi.ucsc.edu/ABRF2010/chr21.bam visibility=pack
  1. Click submit. Click "Genome Browser" in the blue top bar. Now you should see an almost-solid dark blue bar with a few dark red stripes, labeled "chr21_export" above and to the left.

catch up

If you don't see the above, got to http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=AngieHinrichs&hgS_otherUserSessionName=ABRF2010_chr21_export_1MB to catch up. There are so many alignments in this 1MB window that the items can't be drawn individually -- the image would become too tall. Let's view a smaller genomic region to get a better look at the alignments.

  1. Click and drag on the base position track to select a window around the couple largest peaks. (If it doesn't work for you, paste chr21:43,169,278-43,191,176 into the position box and click "jump".)
  2. Now we are viewing about 22,000 bases covering two peaks, and there are still so many alignments that the track is forced into "squish" display mode: items are shown at half-height and read names are suppressed. Let's zoom in again, to the right side of the left peak (or jump to position chr21:43,172,778-43,173,354).
  3. Even when viewing 577 bases, there is still quite a pileup of read alignments to view. The read names take up a lot of room, so let's hide them. Click on the light blue bar to the left of the chr21_export track, to go to track controls.
  4. There is a checkbox to the right of "Display read names" -- click to clear the checkbox, and then click submit.
  5. Mismatches appear as light red tickmarks. They are a bit easier to see if the items are drawn in grayscale, so again click on the light blue bar on the left to return to track controls.
  6. Under "additional coloring modes", select "Use gray for [alignment quality]", and click submit.
  7. Note the trend for mismatches to appear more often in alignments with lower quality, and also the relative frequency of lower-quality reads at the edge of the peak.

catch up

If you don't see the above, go to http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=AngieHinrichs&hgS_otherUserSessionName=ABRF2010_chr21_export_grayQual to catch up.

  1. Zoom in once again to focus on the region with the most mismatching items (chr21:43,172,834-43,172,931). Now we see the exact mismatches between reference and read bases.
  2. Let's take a look at base qualities. Once again, click the light blue bar on the left to return to track controls.
  3. Change the selection to "Use gray for [base qualities]", and click submit. Some of the mismatches, but not all, almost disappear due to the light shades used for low quality bases.

catch up

If you don't see the above, go to http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=AngieHinrichs&hgS_otherUserSessionName=ABRF2010_chr21_export_baseQual to catch up.

Extra: Finding variants in alignment display

In some regions, stacks of red tickmarks (mismatches) indicate HeLa variants with respect to the reference genome assembly. Here are several examples.

  1. Paste chr21:37,366,724-37,366,992 into the position box and click "jump".
  2. To compare with variants in dbSNP, scroll down to the Variation section and change the visibility of SNPs (130) to pack. Click "refresh" on any blue section header.
  3. The SNP track appears at the bottom of the image. SNPs classified "coding non-synonymous" by dbSNP are colored red. (If you don't see that, click http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=AngieHinrichs&hgS_otherUserSessionName=ABRF2010_chr21_export_SNPs to catch up.)
  4. To see a variant not in dbSNP, paste chr21:37,367,831-37,367,845.
  5. To see a heterozygous variant adjacent to a dbSNP 3-base indel, paste chr21:37,367,478-37,367,492.
  6. This 110-base region contains three know variations from the reference assembly: chr21:37,366,860-37,366,969.

Where to learn more