This page guides the user through a systems-level analysis of the response in gene expression in Halobacterium salinarum NRC-1 to changes in oxygen levels. It demonstrates the capabilities of the Firegoose, a data integration extension to the Firefox browser and part of the Gaggle framework.

Warning: This tutorial contains links to the 2005-11 version of the Gaggle tools and is being kept here for backward compatibility. Most users should use a more recent version.

Part 2 demonstrates the Firegoose and can be done independently of Part 1.

Requirements for Part 1

Note: Installing R and setting it up to communicate with the Gaggle takes some effort. For details, see the installation instructions for the R goose. R is not required for Part 2.

Requirements for Part 2

The Scenario

A researcher undertakes a study of the physiological response in H. salinarum to changes in oxygen level. She does a series of 61 microarrays under conditions of varying oxygen concentration.

What can we learn from this data?

Part 1: (Optional) Finding and clustering differentially expressed genes

The first step of this analysis is to find genes whose expression changes under conditions where the level of oxygen has been perturbed relative to a reference condition. Then we use a clustering algorithm to group together genes whose expression changes in similar ways.

This part of the analysis exercises the Gaggle and a few desktop tools, namely R, MeV, and the DMV. The list of genes computed in Part 1 has been encoded in the Gaggle microformat and embedded in this page. The reader can skip directly to Part 2 to analyze these genes using the Firegoose.

Start Gaggle tools

  1. Start the Gaggle Boss.
  2. Start the DMV.
  3. Start R. Connect to the Gaggle by typing the following at the R command prompt:
    library(gaggle)
    source("http://gaggle.systemsbiology.net/R/gaggleUtil.R")
    gaggleInit()
    
    (or just cut-n-paste!)

Load the microarray data into the DMV

  1. In the DMV, open the environmental folder on the left and select oxygen. A button labeled with a red 61 should appear in the top of the window. Click the button to load microarray data for the 61 oxygen conditions.
  2. The DMV should open two tabs: lambdas and log10 ratios.

For each condition we have two measurements per gene:

Broadcast ratios to R

  1. In the DMV, Make sure the log10 ratios tab is selected.
  2. Click the All button to select all rows in the log10 matrix.
  3. Click Update and select R in the drop-down list of geese.
  4. Broadcast the matrix to R by clicking the button marked M.

Normalize microarray data

  1. In R, assign the name ratios to the matrix we just broadcast.
  2. Note that on some platforms (Windows) the matrix ready message fails to appear. Type "dim(ratios)" to verify the size of the matrix.
  3. In one step, we will normalize ratios to a mean of 0 and a standard deviation of 1 and broadcast the normalized matrix back to the DMV. This is done to make comparing expression profiles during clustering easier.

matrix ready, dimension 2400 x 61
> ratios <- getMatrix()

> dim(ratios)
[1] 2400   61

> broadcast(normalize(ratios), "log10_ratios_normalized")
  1. A new tab labeled "log10_ratios_normalized" should appear in the DMV.

Broadcast lambdas to R

  1. Select the lambdas tab in the DMV.
  2. Press All to select all rows.
  3. R should still be selected in the drop-down list of geese.
  4. Press M to broadcast the matrix to R.

Find differentially expressed genes

Out of the ~2400 unique genes in H. salinarum, we want to find those significantly differentially expressed under our experimental conditions. To accomplish this, a threshold can be applied in R to the lambda values.

  1. Assign the matrix just broadcast to a variable as before. This time call it lambdas.
  2. Again, verify the dimensions of the matrix.
  3. The filterMatrix function returns a list of row names where the data in the row passes a threshold. Here we require that 6 conditions have a lambda value of at least 50. The FALSE parameter indicates that we don't care whether the 6 conditions are consecutive. The choice of these parameters is somewhat arbitrary and some judgement must be applied.
  4. broadcast these genes back to the DMV.

matrix ready, dimension 2400 x 61
> lambdas <- getMatrix()

> dim(lambdas)
[1] 2400   61

> sig_genes <- filterMatrix(lambdas, 50, 6, FALSE)

> sig_genes

  [1] "VNG0013C" "VNG0014C" "VNG0017H" "VNG0018H" "VNG0022H" "VNG0027H"
  [7] "VNG0028C" "VNG0029H" "VNG0033H" "VNG0043H" "VNG0049H" "VNG0053H"
...
[445] "VNG6432H" "VNG6439H" "VNG6441H"

> broadcast(sig_genes)
  1. In the log10_ratios_normalized tab of the DMV, the 447 rows that passed the filter should be selected.

Cluster submatrix using MeV

We use clustering to group together genes with similar expression profiles.

  1. Start MeV (multiexperiment viewer).
  2. In DMV, make sure the log10_ratios_normalized tab is selected and the 447 rows from the previous step are selected. Click M to broadcast the submatrix to MeV.
  3. A green and red heat-map should appear in MeV.
  4. Press the PCA button and click "OK" to cluster the expression profiles using the priciple component algorithm. PCA is a technique that projects data of high dimesionality onto fewer axes.
  5. Open the PCA genes folder (left of screen) and open the subfolders Projections on the PC axes, Components 1,2,3, and 2D Views. The axis that corresponds best to response to oxygen is axis 1. Click on "1,2" to plot the genes against these axes.
  6. Use the mouse to draw an ellipse around the genes to the left of the Y-axis. Select as many as possible without crossing the Y-axis.

Clustering by principle component analysis

Highlight the selected cluster in the DMV

  1. Return momentarily to the DMV. Press "Clear" to clear the current selections.
  2. Broadcasting the clustered genes is slightly awkward. Back in MeV, right-click on "1,2" (on the left) and select Launch new session. A heat-map will appear containing just the genes in the selected cluster.
  3. Open the Gaggle menu and click "Broadcast" to broadcast the gene names.
  4. In the DMV, 222 genes should be selected (give or take a few depending on how generous an ellipse you drew).

A similar procedure can be used to select the genes activated by oxygen, which lie on the right side of the Y-axis.

Resulting gene clusters

The results of this part of the analysis are a pair of gene clusters broadly classified into those whose expression is correlated with oxygen levels and those whose expression is anticorrelated with oxygen. Cluster 1 holds the genes induced by the absence of oxygen. Cluster 2 holds genes induced by the presence oxygen.

Cluster 1: Anaerobic genes

These 222 genes were found to be genes activated under anaerobic conditions.

[+] Show/hide cluster 1.

Cluster 2: Aerobic genes

These 223 genes were found to be genes repressed under anaerobic conditions and active under aerobic conditions.

[+] Show/hide cluster 2.

Part 2: Exploring the functions of anaerobically induced genes.

In this part of the analysis we demonstrate the utility of the Firegoose by discovering the cellular processes in which the differentially regulated genes take part. We will concentrate on the anaerobically induced genes, but a similar analysis could be done on the aerobically induced genes as well.

If you haven't already, install the Firegoose add-on to the Mozilla browser (instructions here). The Firegoose adds a toolbar to the browser that enables quick and easy data transfer between the Gaggle and a number of popular biological web sites. The parts of the toolbar and their functions are shown below.

Firegoose toolbar

  1. If the Gaggle Boss is not already running, start the Gaggle Boss.
  2. Press the multicolored Gaggle button on the Firegoose toolbar to connect the Firegoose to the Boss. The status indicator in the lower right of the browser should say, "Firegoose: Connected" and the Firegoose should appear in the list of connected geese in the Boss.

Known problems

The Firegoose has some bugs the cause strange behavior when there are multiple browser windows open. Also, there are points at which time consuming operations are happening in the background without a clear indication for the user to wait. Please be patient with these problems.

Step 1: Find KEGG Pathways

Query the KEGG Pathway database for standard biochemical pathways in which our genes of interest participate.

  1. In the Firegoose toolbar, click on the Gaggle Data drop-down menu and select cluster 1 anaerobic genes: NameList(222). The toolbar has detected the presence of data embedded in this web page in the Gaggle microformat.
  2. In the target menu, select KEGG Pathway. The list of 222 anaerobically induced genes is now ready to be broadcast to KEGG.
  3. Click the Broadcast button.

A new tab should open in the browser window showing KEGG pathways found for our query. Along with many genes of unknown function, we see a few interesting results:

H. salinarum ferments arginine in the absence of oxygen, a process in which some of the amino acid metabolism genes are likely involved. Other pathways include transporters by which the organism may alter its uptake of nutrients. Finally, two transcription factors are identified, providing clues to the regulatory systems involved.

Step 2: Search EMBL STRING

EMBL STRING is a database of functional associations between proteins. We will next query STRING with our set of genes of interest.

Note: there will be a long pause while STRING is processing the query. Wait while the message "Waiting for string.embl.de..." appears in the lower left corner of the browser.

  1. Make sure cluster 1 anaerobic genes: NameList(222) is still selected in the Gaggle Data menu.
  2. Select the target EMBL STRING and click Broadcast.
  3. The initial STRING query page will appear. Wait while STRING processes the query, which will take about 10-15 seconds. A list of proteins will appear. Click the Continue -> button.

String network

A large network will appear showing the proteins as nodes and colored edges indicating various kinds of evidence for functional associations. Some connected components in the network correspond to the pathways found in KEGG. Others represent specialized pathways not present in the KEGG database. One of these is Dimethylsulfoxide respiration, which appears in the network as 6 interconnected proteins.

DMSO respiration proteins

Clicking on the nodes in a STRING network gives more information including protein domains and annotations. Clicking on VNG0829G reveals its function as a DMSO reductase. Another group of 5 interconnected proteins are gas vesicle proteins (gvp) which H. salinarum uses to control bouyancy.

Step 3: Examine the VNG1187G cluster

We'll drill down into STRING's supporting evidence for the group of genes of unknown function containing VNG1187G.

String 1187

  1. Click on VNG1187G in the STRING network. In the pop-up menu click on recenter the network on this node. VNG1187G should now appear in a network by itself.
  2. Click more twice to expand the network out to second neighbors.

This brings in proteins that were not identified as differentially expressed, but may still participate in pathways with the proteins linked to VNG1187G. The additional nodes may shed some light on the function of these proteins.

  1. Click on the nodes in the new network and examine their annotations and domains.
  2. Click the browser's Back button 3 times to return to the full network.

Step 4: Broadcast to Cytoscape

The size of the network makes it unweildy and STRING provides no way to select subnetworks. So, we want to broadcast the full STRING network to Cytoscape. Again, be prepared for a longish pause.

  1. Start Cytoscape (with cygoose plug-in).
  2. Make sure Cytoscape connects to the Boss.
  3. Click the multicolored Gaggle button on the Firegoose toolbar to update list of connected Geese.
  4. With the STRING tab open in the browser, select Protein interactions from STRING: Network in the Gaggle Data menu.
  5. Select Cytoscape in Firegoose's target menu.
  6. Click Broadcast. Wait ~10-15 seconds while the toolbar downloads the network as an XML file, parses it, and transfers it to Cytoscape.
  7. Switch to Cytoscape. Open the Layout menu and select yFiles->Organic. Zoom in until the node labels become visible.

Cytoscape

Now that we have our network in a more interactive form, we can select components and broadcast them. This enables us to find more information about selected subsets of the network.

Detailed instructions for using Cytoscape are available at cytoscape.org.

Step 5: Broadcast selected nodes to annotation database

The Halo Annotations target refers to an organism specific database of functional annotations based on sequence and structure based computation and experimental evidence. We can query this database by selecting some nodes in Cytoscape and broadcasting them (as a List) to the Firegoose, then rebroadcasting them to to the Halo Annotations target.

  1. Select the group of four connected genes containing VNG1187G in Cytoscape.
  2. In the CyGoose panel at the left of the Cytoscape window, select Firegoose as the target.
  3. Click the button marked List to broadcast the four gene identifiers to Firegoose.
  4. In the Firegoose, make sure gaggle: NameList(4) is selected in the Gaggle Data menu and select Halo Annotations in the target menu. Click Broadcast.

Halo Annotations

We see that some of these genes are annotated as being involved in nitrite reduction. We can drill down into the supporting evidence by clicking the links in the Function column.

  1. Click on "putative Cu-containing nitrite reductase", which is the annotation for VNG1187G.
  2. Then click on the match labeled 1kbvA to show the entry in the Protein Data Bank on which this annotation was based.

Try the same procedure on other groups of genes. Select them in Cytoscape, broadcast to Firegoose, then use the Firegoose to query the annotations database or other data sources, for example, Entrez Protein.

Step 6: Perform functional clustering with DAVID

DAVID is another resource for functional annotations and protein domains. One challenge in using DAVID is that it doesn't work well with the VNG naming system used in H. salinarum. We'll have to translate our broadcasts to and from DAVID from VNG identifiers to GI numbers.

(optional)

  1. Start the VNG->GI translator.
  2. Select Firegoose as the target in the translator.
  3. Check the "auto" checkbox. With auto checked, the translator will automatically translate and rebroadcast any broadcast it receives.
  4. Update the list of geese in Firegoose by clicking the multicolored Gaggle button.
  5. From the Firegoose, broadcast the cluster 1 anaerobic genes: NameList(222) to the translator. Matching GI numbers will be automatically rebroadcast back to the Firegoose. gaggle: NameList(219) will appear in the Gaggle Data menu (a few genes fail to translate).

Translator

For convenience, the translated list of genes has been embedded in this page and appears as Anaerobic genes as GI numbers: NameList(219) in the Gaggle Data menu.

[+] Show/hide Anaerobic genes as GI numbers.

  1. Broadcast either of gaggle: NameList(219) or Anaerobic genes as GI numbers: NameList(219) to DAVID.

The genes are pasted into the search form of DAVID. We still have to tell DAVID that these are GI numbers.

  1. Select "GI_ACCESSION" in the drop-down labeled Step 2: select identifier.
  2. Click Submit List.
  3. When DAVID has finished processing the uploaded list, press the button marked Functional Annotation Clustering.

DAVID

DAVID groups our proteins into clusters based on functional annotations from several primary sources. We can broadcast a cluster or the subcategories within a cluster back to the Gaggle, but doing so requires another translator, from GI to VNG this time, and a trick.

(optional)

  1. Start the GI->VNG translator.
  2. Click the multicolored Gaggle button on the toolbar to update the list of connected geese.
  3. Back in the DAVID clustering results, find the cluster containing signal transduction annotations.
  4. Click on the red G.
  5. Now the trick; we want to select the first column holding GI_ACCESSION numbers in the table. We can do that by holding down the control key (on windows) or the command key (on OS X) and dragging the mouse over the table cells in that column. Those cells should then be selected.
  6. Right-click on the selected cells and choose Capture Selection. An entry Selection: geneList(4) should appear in the Gaggle Data menu.

Capture Selections

Step 7: View signal transduction protein domains in STRING

We could hunt and find these proteins in the original large STRING network, but viewing them separately would be easier.

  1. Broadcast the Selection: geneList(4) to the GI->VNG translator.
  2. From the translator, broadcast the corresponding VNG numbers back to the Firegoose.
  3. Rebroadcast to STRING. Wait until the list of proteins shows up. Press Continue ->.

Alternatively, broadcast the embedded Signal transducers: NameList(4) to STRING. We should note that although Bop clusters with signal transducers, it's true function is phototrophy.

[+] Show/hide Signal transducers.

STRING protein domains

Now we can explore the domains identified in these signal transducers.

  1. Click on a protein in the STRING network.
  2. Follow the links to find out more about its known domains.

Conclusions

Using the Firegoose, the Gaggle, and several public data sources, we have created an integrated visualization of the cellular processes activated by H. salinarum in response to anoxic environments. The diagram summarizes our results.

Cellular processes activated in anoxic environments

Edges represent several types of evidence for functional association provided by STRING. Yellow filled nodes indicate genes classified by KEGG. Blue outline nodes indicate genes classified by DAVID. Other nodes were characterized by the annotation database or other sources, including PFAM, BLAST, and PDB. 102 genes of unknown function were omitted.

The above network is availabe for viewing and manipulation in Cytoscape. Try selecting a group of nodes and broadcasting to STRING or the Halobacterium annotations database.

String network in Cytoscape (1.x).

Institute for Systems Biology

validate