Making a hub for a cell browser: Difference between revisions
(rearranging page a bit) |
|||
Line 1: | Line 1: | ||
This page | This page is divided into two sections: | ||
# A discussion of makeCbHub and it's options | |||
# How to make a [https://genome.ucsc.edu/goldenPath/help/hgTrackHubHelp.html#Intro track hub] for a cell browser utilizing makeCbHub | |||
= <code>makeCbHub</code> = | |||
makeCbHub will generate the trackDb stanzas for a set of tracks that you can then use to create a track hub. This section will go through the various options for the script and how to use them | |||
== | == Required option: <code>fileDir</code> == | ||
There is one required option: <code>fileDir</code>. This is a directory of big* files or a directory of directories containing big* files. | |||
For example, use the commands below to generate the trackDb stanzas for a single composite track in the mouse-brain-cutandtag dataset: | For example, use the commands below to generate the trackDb stanzas for a single composite track in the mouse-brain-cutandtag dataset: | ||
Line 58: | Line 44: | ||
visibility dense | visibility dense | ||
</pre> | </pre> | ||
As you can see, it works, though it wouldn't be particularly pretty to look at the Genome Browser. The labels are not very human-friendly and both tracks will be colored the same, default color, black. | As you can see, it works, though it wouldn't be particularly pretty to look at the Genome Browser. The labels are not very human-friendly and both tracks will be colored the same, default color, black. | ||
==Options to customize output== | |||
The six optional arguments for makeCbHub allow you greater control over what’s put into these trackDb stanzas, including shortLabels, longLabels, and colors. | The six optional arguments for makeCbHub allow you greater control over what’s put into these trackDb stanzas, including shortLabels, longLabels, and colors. | ||
===Composite track labels: <code>-d/--datasetList</code>=== | |||
The option <code>-d/--datasetList</code> serves two purposes: | The option <code>-d/--datasetList</code> serves two purposes: | ||
Line 96: | Line 81: | ||
This command assumes that in bw/ (<code>fileDir</code>), there are two directories: rad21 and olig2, but it will use Rad21 and Olig2 as the shortLabel/longLabel for those composites in the trackDb. | This command assumes that in bw/ (<code>fileDir</code>), there are two directories: rad21 and olig2, but it will use Rad21 and Olig2 as the shortLabel/longLabel for those composites in the trackDb. | ||
===Individual track labels: <code>-s/--shortLabel</code> and <code>-l/–longLabel</code>=== | |||
The options <code>-s/--shortLabel</code> and <code>-l/–longLabel</code> allow you to control the short and long labels of the individual tracks in the composites. By default the script uses the file names as the labels, which, depending on how the files are named, can be pretty messy: | The options <code>-s/--shortLabel</code> and <code>-l/–longLabel</code> allow you to control the short and long labels of the individual tracks in the composites. By default the script uses the file names as the labels, which, depending on how the files are named, can be pretty messy: | ||
Line 152: | Line 137: | ||
'''Note''': the shortLabel/longLabel files can be csv or tsv format and their file names need to end with csv or tsv (e.g. shortLabels.tsv). | '''Note''': the shortLabel/longLabel files can be csv or tsv format and their file names need to end with csv or tsv (e.g. shortLabels.tsv). | ||
===Colors: <code>-c/--colors</code> === | |||
If a track doesn't include the 'color' setting, the Genome Browser colors that track black by default. The -c/–color option allows you to specify the color for each track. The file follows the same format as the 'colors' setting that can be specified in the cellbrowser.conf (and will mostly likely be the same file). The two columns are | If a track doesn't include the 'color' setting, the Genome Browser colors that track black by default. The <code>-c/–color</code> option allows you to specify the color for each track. The file follows the same format as the 'colors' setting that can be specified in the cellbrowser.conf (and will mostly likely be the same file). The two columns are | ||
# short label | # short label | ||
# hexcode color (e.g. #834088), although RGB tuples (e.g. '131 64 136' or '131, 64, 136') are also accepted | # hexcode color (e.g. #834088), although RGB tuples (e.g. '131 64 136' or '131, 64, 136') are also accepted | ||
=== | === Track description page: <code>-f/--html</code> === | ||
A track description page is there to briefly describe the data displayed by that track, any special display conventions, and contact info for the data creators. | |||
Creating and adding a track description page is discussed in more detail in the BLAH section. | |||
=== bigBed tracks: <code>-b/--bbType</code> === | |||
Most hubs you'll be building are going to be based on bigWig files. Occasionally, you may need to build one that includes a composite of bigBed files. The <code>-b/--bbType</code> option allows you to specify the bigBed type, e.g. 'bigBed 12 +', 'bigInteract', 'bigNarrowPeak'. The need for this option is because bigBed files are quite diverse. Many of the [http://genome.ucsc.edu/FAQ/FAQformat.html big* formats] except for bigWig essentially use the bigBed format as a base but use a different 'type' in the trackDb to tell the Genome Browser how to interpret and display the different columns. It would be difficult to auto-detect the type from just the info in the file itself. | |||
Here's an example using this option: | |||
<pre> | <pre> | ||
cd /hive/data/inside/cells/datasets/ | cd /hive/data/inside/cells/datasets/human-enhancer-atlas/hub | ||
makeCbHub - | makeCbHub --bbType bigNarrowPeak bb | ||
</pre> | </pre> | ||
Which | Which outputs: | ||
<pre> | |||
track bb | |||
compositeTrack on | |||
shortLabel bb | |||
longLabel bb | |||
visibility dense | |||
autoScale group | |||
type bigBed | |||
track bb_Acinar | |||
track | parent bb on | ||
parent | shortLabel Acinar | ||
shortLabel | longLabel Acinar | ||
longLabel | type bigNarrowPeak | ||
type | bigDataUrl bb/Acinar.bigBed | ||
bigDataUrl | |||
visibility dense | visibility dense | ||
</pre> | |||
===Using all these options together=== | |||
Altogether, these options result in a solid track hub with user-friendly labels, colors, and a track description page. We'll skip the <code>-b/--bbType</code> since it's so rarely used. Here's a example you can run to see the results of combining these various options yourself: | |||
<pre> | <pre> | ||
cd /hive/data/inside/cells/datasets/olg-eae-ms/eae-multiomics/hub | cd /hive/data/inside/cells/datasets/olg-eae-ms/eae-multiomics/hub | ||
makeCbHub -s shortLabels.tsv -l acronyms.sorted.tsv -c ../colors.csv -f track_desc.html bw | makeCbHub -s shortLabels.tsv -l acronyms.sorted.tsv -c ../colors.csv -f track_desc.html bw/ | ||
</pre> | </pre> | ||
Which | Which gets us: | ||
<pre> | <pre> | ||
track bw | track bw | ||
Line 208: | Line 198: | ||
html track_desc.html | html track_desc.html | ||
... | |||
track bw_P21208_1004_OPC_Ctr_RND1_peaks | track bw_P21208_1004_OPC_Ctr_RND1_peaks | ||
parent bw on | parent bw on | ||
shortLabel OPC_Ctr | |||
longLabel OPC_Ctr - Control oligodendrocyte precursor cells | |||
type bigWig 0.000000 160.000000 | |||
autoScale group | |||
color 131,64,136 | |||
bigDataUrl bw/P21208_1004_OPC_Ctr_RND1_peaks.bw | |||
visibility dense | |||
... | ... | ||
</pre> | </pre> | ||
In this case, you may want to adjust the short/long labels for the parent track from bw to something like "Chromatin Accessibility", but small improvements like that are relatively easy. You can take a look at the hub in a [https://genome.ucsc.edu/s/mspeir/mm10_eae_hub Genome Browser] session or take a look at the hub.txt used for that hub: https://cells.ucsc.edu/olg-eae-ms/eae-multiomics/hub/hub.txt. | |||
= Building a hub = | |||
==File organization== | |||
The script assumes a certain directory structure when it creates the trackDb stanzas. | |||
In your dataset directory, create a ‘hub’ directory where all of the hub-related files will live. In that hub directory, you will then create a directory for each [https://genome.ucsc.edu/goldenPath/help/trackDb/trackDbHub.html#compositeTrack composite/parent] track (directory names should be all lowercase): | |||
<pre> | |||
cb_dataset_dir/ | |||
|--> hub/ | |||
|--> track_set_A/ | |||
|--> track_A1.bw | |||
|--> track_A2.bw | |||
|--> etc… | |||
|--> track_set_B | |||
|--> track_B1.bw | |||
|--> etc… | |||
</pre> | |||
Dividing the individual tracks into composite/parent tracks will vary from dataset to dataset. For example, in the collection <code>mouse-brain-cutandtag</code>, individual tracks were divided into a composite track for each dataset in the collection (e.g. h3k27ac, h3k27me3, h3k27me3-cell-lines, h3k36me3, h3k4me3, olig2, rad21) as this was what was requested by the authors. In neuro-degen-atac, individual tracks were grouped according to their corresponding metadata field (e.g. broad-celltypes, clusters, neuronal-celltypes, neuronal-clusters). If you’re not sure how to group the tracks ask Max and/or the contributors. | |||
Finally, it’s best to make symlinks to the track files in the orig directory to prevent the unnecessary duplication of large amounts of files. See <code>human-enhancer-atlas/hub</code> and <code>fetal-chromatin-atlas/hub</code> as examples, where the bigWigs alone were 212 GB and 138 GB, respectively. (/hive has a ton of storage, but it's good to not waste space unnecessarily.) | |||
==Running the script== | |||
==Getting the hub ready== | ==Getting the hub ready== | ||
Line 241: | Line 267: | ||
After that, there should be a line break and then the 'genome' setting. You will need to confirm with the contributor about what genome assembly they aligned their data to, though it will most likely be one of the following: hg19, hg38, mm10, mm39. If it's a public dataset, you may be able to find this information in the methods section. | After that, there should be a line break and then the 'genome' setting. You will need to confirm with the contributor about what genome assembly they aligned their data to, though it will most likely be one of the following: hg19, hg38, mm10, mm39. If it's a public dataset, you may be able to find this information in the methods section. | ||
===Track description page=== | |||
Additionally, you'll need to create a track description page for the hub. A track description page describes the data that people are looking at, any display conventions to keep in mind for that data (e.g. coloring), a contact email, and a reference to the paper(s). If you keep the page fairly high-level, you can use the same one for all of the tracks in the hub. Even if there are multiple composite/parent tracks in your hub, you can use the same description page for all of them since often the data are very similar. | |||
You can use the Genome Browser [https://genome.ucsc.edu/goldenPath/help/examples/hubExamples/templatePage.html template] as a starting point. Additionally, it can be useful to look at the track description pages for other hubs in the Cell Browser: | |||
* <code>olg-eae-ms/eae-multiomics</code>: https://cells.ucsc.edu/olg-eae-ms/eae-multiomics/hub/track_desc.html | |||
* <code>olg-eae-ms/eae-atac</code>: https://cells.ucsc.edu/olg-eae-ms/eae-atac/hub/track_desc.html | |||
* <code>neuro-degen-atac</code> dataset: https://cells.ucsc.edu/neuro-degen-atac/track_desc.html (A good example of covering multiple composites with one page) | |||
Once you have the page written, place it in the <code>hub/</code> directory and use cbMakeHub to add the proper settings to the trackDb stanzas: | |||
<pre> | |||
cd /hive/data/inside/cells/datasets/olg-eae-ms/eae-multiomics/hub | |||
makeCbHub -s shortLabels.tsv -l acronyms.sorted.tsv -c ../colors.csv -f track_desc.html bw | |||
</pre> | |||
Which outputs: | |||
<pre> | |||
track bw | |||
compositeTrack on | |||
shortLabel bw | |||
longLabel bw | |||
visibility dense | |||
autoScale group | |||
type bigWig | |||
html track_desc.html | |||
track bw_P21208_1004_OPC_Ctr_RND1_peaks | |||
parent bw on | |||
... | |||
</pre> |
Revision as of 18:55, 7 July 2022
This page is divided into two sections:
- A discussion of makeCbHub and it's options
- How to make a track hub for a cell browser utilizing makeCbHub
makeCbHub
makeCbHub will generate the trackDb stanzas for a set of tracks that you can then use to create a track hub. This section will go through the various options for the script and how to use them
Required option: fileDir
There is one required option: fileDir
. This is a directory of big* files or a directory of directories containing big* files.
For example, use the commands below to generate the trackDb stanzas for a single composite track in the mouse-brain-cutandtag dataset:
cd /hive/data/inside/cells/datasets/mouse-brain-cutandtag/hub makeCbHub olig2 Output: track olig2 compositeTrack on shortLabel olig2 longLabel olig2 visibility dense autoScale group type bigWig track olig2_cluster_non_oligo parent olig2 on shortLabel cluster_non_oligo longLabel cluster_non_oligo type bigWig 0.000000 2358.294189 autoScale group bigDataUrl olig2/cluster_non_oligo.bw visibility dense track olig2_cluster_oligo parent olig2 on shortLabel cluster_oligo longLabel cluster_oligo type bigWig 0.000000 280.645325 autoScale group bigDataUrl olig2/cluster_oligo.bw visibility dense
As you can see, it works, though it wouldn't be particularly pretty to look at the Genome Browser. The labels are not very human-friendly and both tracks will be colored the same, default color, black.
Options to customize output
The six optional arguments for makeCbHub allow you greater control over what’s put into these trackDb stanzas, including shortLabels, longLabels, and colors.
Composite track labels: -d/--datasetList
The option -d/--datasetList
serves two purposes:
- If
fileDir
contains multiple dirs, you can specify which of those you want to build trackDb stanzas for - By default, the directory names under
fileDir
are used as the labels for the composite/parent tracks in the trackDb. However, these are required to be all lowercased (e.g. h3k27ac, bw, or clusters). This option allows one to specify the casing used for the short/long labels.
makeCbHub -d “Rad21 Olig2” bw/ track rad21 compositeTrack on shortLabel Rad21 longLabel Rad21 visibility dense autoScale group type bigWig ... track olig2 compositeTrack on shortLabel Olig2 longLabel Olig2 visibility dense autoScale group type bigWig ...
This command assumes that in bw/ (fileDir
), there are two directories: rad21 and olig2, but it will use Rad21 and Olig2 as the shortLabel/longLabel for those composites in the trackDb.
Individual track labels: -s/--shortLabel
and -l/–longLabel
The options -s/--shortLabel
and -l/–longLabel
allow you to control the short and long labels of the individual tracks in the composites. By default the script uses the file names as the labels, which, depending on how the files are named, can be pretty messy:
cd /hive/data/inside/cells/datasets/olg-eae-ms/eae-multiomics/hub makeCbHub bw/
Which results in:
... track bw_P21208_1004_OPC_Ctr_RND1_peaks parent bw on shortLabel P21208_1004_OPC_Ctr_RND1_peaks longLabel P21208_1004_OPC_Ctr_RND1_peaks type bigWig 0.000000 160.000000 autoScale group bigDataUrl bw/P21208_1004_OPC_Ctr_RND1_peaks.bw visibility dense ...
However, if we rebuild the trackDb with these options:
cd /hive/data/inside/cells/datasets/olg-eae-ms/eae-multiomics/hub makeCbHub -s shortLabels.tsv -l acronyms.sorted.tsv bw/
Which outputs:
... track bw_P21208_1004_OPC_Ctr_RND1_peaks parent bw on shortLabel OPC_Ctr longLabel OPC_Ctr - Control oligodendrocyte precursor cells type bigWig 0.000000 160.000000 autoScale group bigDataUrl bw/P21208_1004_OPC_Ctr_RND1_peaks.bw visibility dense ...
The shortLabels file contains two columns.
- file name
- short label
Here's the line from the shortLabels.tsv
used in the example above:
P21208_1004_OPC_Ctr_RND1_peaks.bw OPC_Ctr
The longLabels file follows the same format as the 'acronymFile' setting that can be specified in the cellbrowser.conf (and will mostly likely be the same file). The two columns are:
- short label
- long label
Here's the line from the acronyms.sorted.tsv
used in the example above:
OPC_Ctr Control oligodendrocyte precursor cells
Note: the shortLabel/longLabel files can be csv or tsv format and their file names need to end with csv or tsv (e.g. shortLabels.tsv).
Colors: -c/--colors
If a track doesn't include the 'color' setting, the Genome Browser colors that track black by default. The -c/–color
option allows you to specify the color for each track. The file follows the same format as the 'colors' setting that can be specified in the cellbrowser.conf (and will mostly likely be the same file). The two columns are
- short label
- hexcode color (e.g. #834088), although RGB tuples (e.g. '131 64 136' or '131, 64, 136') are also accepted
Track description page: -f/--html
A track description page is there to briefly describe the data displayed by that track, any special display conventions, and contact info for the data creators.
Creating and adding a track description page is discussed in more detail in the BLAH section.
bigBed tracks: -b/--bbType
Most hubs you'll be building are going to be based on bigWig files. Occasionally, you may need to build one that includes a composite of bigBed files. The -b/--bbType
option allows you to specify the bigBed type, e.g. 'bigBed 12 +', 'bigInteract', 'bigNarrowPeak'. The need for this option is because bigBed files are quite diverse. Many of the big* formats except for bigWig essentially use the bigBed format as a base but use a different 'type' in the trackDb to tell the Genome Browser how to interpret and display the different columns. It would be difficult to auto-detect the type from just the info in the file itself.
Here's an example using this option:
cd /hive/data/inside/cells/datasets/human-enhancer-atlas/hub makeCbHub --bbType bigNarrowPeak bb
Which outputs:
track bb compositeTrack on shortLabel bb longLabel bb visibility dense autoScale group type bigBed track bb_Acinar parent bb on shortLabel Acinar longLabel Acinar type bigNarrowPeak bigDataUrl bb/Acinar.bigBed visibility dense
Using all these options together
Altogether, these options result in a solid track hub with user-friendly labels, colors, and a track description page. We'll skip the -b/--bbType
since it's so rarely used. Here's a example you can run to see the results of combining these various options yourself:
cd /hive/data/inside/cells/datasets/olg-eae-ms/eae-multiomics/hub makeCbHub -s shortLabels.tsv -l acronyms.sorted.tsv -c ../colors.csv -f track_desc.html bw/
Which gets us:
track bw compositeTrack on shortLabel bw longLabel bw visibility dense autoScale group type bigWig html track_desc.html ... track bw_P21208_1004_OPC_Ctr_RND1_peaks parent bw on shortLabel OPC_Ctr longLabel OPC_Ctr - Control oligodendrocyte precursor cells type bigWig 0.000000 160.000000 autoScale group color 131,64,136 bigDataUrl bw/P21208_1004_OPC_Ctr_RND1_peaks.bw visibility dense ...
In this case, you may want to adjust the short/long labels for the parent track from bw to something like "Chromatin Accessibility", but small improvements like that are relatively easy. You can take a look at the hub in a Genome Browser session or take a look at the hub.txt used for that hub: https://cells.ucsc.edu/olg-eae-ms/eae-multiomics/hub/hub.txt.
Building a hub
File organization
The script assumes a certain directory structure when it creates the trackDb stanzas.
In your dataset directory, create a ‘hub’ directory where all of the hub-related files will live. In that hub directory, you will then create a directory for each composite/parent track (directory names should be all lowercase):
cb_dataset_dir/ |--> hub/ |--> track_set_A/ |--> track_A1.bw |--> track_A2.bw |--> etc… |--> track_set_B |--> track_B1.bw |--> etc…
Dividing the individual tracks into composite/parent tracks will vary from dataset to dataset. For example, in the collection mouse-brain-cutandtag
, individual tracks were divided into a composite track for each dataset in the collection (e.g. h3k27ac, h3k27me3, h3k27me3-cell-lines, h3k36me3, h3k4me3, olig2, rad21) as this was what was requested by the authors. In neuro-degen-atac, individual tracks were grouped according to their corresponding metadata field (e.g. broad-celltypes, clusters, neuronal-celltypes, neuronal-clusters). If you’re not sure how to group the tracks ask Max and/or the contributors.
Finally, it’s best to make symlinks to the track files in the orig directory to prevent the unnecessary duplication of large amounts of files. See human-enhancer-atlas/hub
and fetal-chromatin-atlas/hub
as examples, where the bigWigs alone were 212 GB and 138 GB, respectively. (/hive has a ton of storage, but it's good to not waste space unnecessarily.)
Running the script
Getting the hub ready
Using makCbHub is just part of the process in getting your hub ready. This section will step through the process of making a hub and getting it to show up for a cell browser. It assumes that your hub only covers one genome assembly (e.g. mm10 or hg38), not multiple. If a track hub only covers one assembly, you can take advantage of the setting 'useOneFile on' to put everything into a single hub.txt file. Hubs on multiple assemblies are a little more complex and require a set of hub.txt/genomes.txt/trackDb.txt files.
Common hub.txt settings
These settings will show up in any hub you create for the Cell Browser, though you will need to adjust the settings according to be specific to your hub:
hub eae_multiomics shortLabel scATAC-seq of EAE Mice longLabel scATAC-seq of EAE Mice genomesFile genomes.txt email eneritz.agirre@ki.se descriptionUrl https://www.sciencedirect.com/science/article/pii/S0896627321010898 useOneFile on genome mm10
Breakdown of what to do for each setting in the first stanza:
- hub - can be the same as the dataset short name (e.g. neuro-degen-atac)
- shortLabel and longLabel - can be the same, though feel free to add more details to the longLabel if needed
- genomesFile - will always be genomes.txt
- email - should be the email of the contributor
- descriptionUrl - can point to the paper or to the track description page you created
- useOneFile - will always be on
After that, there should be a line break and then the 'genome' setting. You will need to confirm with the contributor about what genome assembly they aligned their data to, though it will most likely be one of the following: hg19, hg38, mm10, mm39. If it's a public dataset, you may be able to find this information in the methods section.
Track description page
Additionally, you'll need to create a track description page for the hub. A track description page describes the data that people are looking at, any display conventions to keep in mind for that data (e.g. coloring), a contact email, and a reference to the paper(s). If you keep the page fairly high-level, you can use the same one for all of the tracks in the hub. Even if there are multiple composite/parent tracks in your hub, you can use the same description page for all of them since often the data are very similar.
You can use the Genome Browser template as a starting point. Additionally, it can be useful to look at the track description pages for other hubs in the Cell Browser:
olg-eae-ms/eae-multiomics
: https://cells.ucsc.edu/olg-eae-ms/eae-multiomics/hub/track_desc.htmlolg-eae-ms/eae-atac
: https://cells.ucsc.edu/olg-eae-ms/eae-atac/hub/track_desc.htmlneuro-degen-atac
dataset: https://cells.ucsc.edu/neuro-degen-atac/track_desc.html (A good example of covering multiple composites with one page)
Once you have the page written, place it in the hub/
directory and use cbMakeHub to add the proper settings to the trackDb stanzas:
cd /hive/data/inside/cells/datasets/olg-eae-ms/eae-multiomics/hub makeCbHub -s shortLabels.tsv -l acronyms.sorted.tsv -c ../colors.csv -f track_desc.html bw
Which outputs:
track bw compositeTrack on shortLabel bw longLabel bw visibility dense autoScale group type bigWig html track_desc.html track bw_P21208_1004_OPC_Ctr_RND1_peaks parent bw on ...