Tutorial 1: Introduction WikiPathways and PathVisio
In this first tutorial you will be introduced to WikiPathways and PathVisio. PathVisio uses the complete analysis pathway collection of WikiPathways. The pathways that are tagged as analysis pathways are part of this collection.
Step 1: Find a pathway in WikiPathways
First, go to wikipathways.org and search for Mitochondrial LC-Fatty Acid Beta-Oxidation in the search box (Figure 1.1). Second, search for the rat Mitochondrial LC-Fatty Acid Beta-Oxidation pathway by selecting the correct species (Figure 1.2). Finally, click at the Mitochondrial LC-Fatty Acid Beta-Oxidation pathway and the pathway will be displayed in full screen (Figure 1.3).
Q1: What is the identifier of the pathway? Hint: Have a look at the web address of the pathway.
|Figure 1.1: Search WikiPathways||Figure 1.2: Select species||Figure 1.3: Display pathway in WikiPathways|
Step 2: Download a pathway in WikiPathways in gpml format
In WikiPathways you can save the pathways in different formats, for example as pdf or png. Another option is as gpml which is the format used by PathVisio. At the bottom of the Mitochondrial LC-Fatty Acid Beta-Oxidation pathway in WikiPathways you will find a download button. Click at this button and save the pathway as gpml (see Figure 2).
|Figure 2: Save pathway in gmpl format|
Step 3: Start PathVisio
Note: If you already installed PathVisio as instructed, you can skip step 2a.
- Copy the directory PathVisio-3.1.0 from the provided USB stick onto your laptop.
- If you want to install PathVisio at home, you can download PathVisio from http://www.pathvisio.org/downloads/. (Download binary installation to use PathVisio offline).
- Start PathVisio by executing the pathvisio.bat file (Windows) or the pathvisio.sh file (Linux / MacOSX) in the PathVisio-3.1.0 directory (Fig. 3.1).
- Now PathVisio will start all modules (Fig 3.2), and the PathVisio main window will be opened (Fig. 3.3)
|Figure 3.1: Start PathVisio||Figure 3.2: PathVisio will start all modules||Figure 3.3: PathVisio opens with an empty pathway view|
Step 4: Select the ID mapping database
The pathway you downloaded is a rat pathway. As mentioned in the morning lecture you need to have an ID mapping database to be able to recognize the genes in the pathway. The rat ID mapping database (=Rn_Derby_20120602.bridge) is available via the USB stick in the pathways directory. To select the gene database in PathVisio go to Data -> Select Gene Database -> Select the Rn_Derby_20120602.bridge file. The selected rat gene database file is now displayed at the bottom panel (see Figure 4).
|Figure 4: Selected human gene database|
Step 5: Open the downloaded pathway in PathVisio
To open the downloaded pathway go to File -> Open -> Select the downloaded pathway in gpml format. Figure 5 shows the pathway in PathVisio.
|Figure 5: Downloaded pathway opened in PathVisio|
Step 6: Select a gene and study side panel
Click at the Cpt1a gene box in the opened rat Mitochondrial LC-Fatty Acid Beta-Oxidation pathway, see Figure 6.1. In the backpage in the panel at the right hand site shows the annotation of the gene and the available cross references.
Go back to the pathway and double click at the Cpt1a gene box. Now a DataNode properties panel is opened showing the annotation, literature and comments, see Figure 6.2.
Q2: Which identifier and database are used to annotate the Scp2 gene in the rat Mitochondrial LC-Fatty Acid Beta-Oxidation pathway?
|Figure 6.1: Selected Cpt1a gene box + backpage||Figure 6.2: DataNode panel|
Tutorial 2: Data Visualization and Analysis in PathVisio
In this tutorial you are going to perform pathway analysis in PathVisio to help biological interpretation. You are going to:
- Search for regulated pathways that might be relevant to study in more detail.
- Visualize your data on a pathway diagram so you can explore the data in a biological context.
- Dataset description
- Step 1: Pathways and identifier mapping databases
- Step 2: Data file preparation for PathVisio
- Step 3: Import the data into PathVisio
- Step 4: Create a visualization by coloring logFC and p.value
- Step 5: Search for regulated pathways
Transcriptomics data set
The transcriptomics data set is published and the data is available via ArrayExpress, see E-MTAB-797.
A subset of the Toxicogenomics Project, a 5-year collaborative project (2002-2007) by a consortium comprising the Japanese government and several pharmaceutical companies, was selected. This project produced a large-scale database of transcriptomics and pathology data potentially useful for predicting the toxicity of new chemical entities. Conventional in vivo toxicology data was collected from single dose and repeat dosing studies on rats, and gene expression measured for the liver.
Takeki Uehara, Atsushi Ono, Toshiyuki Maruyama, Ikuo Kato, Hiroshi Yamada, Yasuo Ohno, Tetsuro Urushidani The Japanese toxicogenomics project: application of toxicogenomics. Mol Nutr Food Res: 2010, 54(2);218-27 [PubMed:20041446] [WorldCat.org] [DOI]
Be aware that this paper gives you a description of the Toxicogenomics projects. At the ArrayExpress entry page you will find a better description of the experimental setup.
A detailed description of the study design can be found at the website of ArrayExpress, see protocols.
Description of selected transcriptomics samples
Hepatocytes of 6 week old male Sprague-Dawley rats were treated for 8 hours with 30 micromolar Fenofibrate. Fenofibrate is an activator (=agonist) of peroxisomal-proliferator receptor (PPAR) alpha. Fenofibrate was added to the medium directly or as a 1,000X stock solution in DMSO. Cells were exposed to compound for 8 hr before collection. After compound exposure, the hepatocytes were lysed with RLT buffer and collected for expression profiling.
NOTE: The fenofibrate treatment was part of a large screen of many toxicological compounds. Fenofibrate was given in three different dosages. Here we choose the highest concentration given for 8 hours
Total RNA was isolated from the hepatocyte lysate using an RNeasy kit (Qiagen). 10 ug of fragmented cRNA was hybridized to the probe array for 18 h at 45C at 60 rpm, after which the array was washed and stained by streptavidin-phycoerythrin using Fluidics Station 400 (Affymetrix) and scanned by Gene Array Scanner (Affymetrix). The Affymetrix GeneChip Rat Genome 230 2.0 [Rat230_2] was used.
The quality of the Affymetrix microarrays used in the rat experiment was analyzed using arrayanalysis.org, an Affymetrix analysis pipeline developed at the Department of Bioinformatics, Maastricht University. Checking the quality ensures that the downstream analysis is not biased by any (large) technical influences, which in turn may lead to a biased biological outcome. After the QC analysis the gene expression data were normalized using GC-RMA normalization.
The normalized gene expression data was statistically analysed using the limma package in R-Bioconductor. This package uses moderated t and F-statistics based on linear modelling in order to perform differential gene expression analysis for data arising from microarray experiments. The main advantage of limma over traditional t or F-tests is, that information is borrowed from other genes for estimation of variences and standard errors of a single gene. This stabilises the analysis particularly for small sample sizes.
Statistically analyzed data set
The statistically analyzed transcriptomics data set is available on the provided USB stick in the dataset directory as Feno_High_vs_Control.txt.
Open the file using excel and have a look at the statistically analyzed data. In the data file you will find the following columns:
- Ensembl: this column contains the identifiers of the genes in the data set.
- Syscode: this column specifies the data source of the identifier. In our example data set we are using En for Ensembl. This column is optional if all the identifiers are from the same database.
- logFC: the fold change is a metric for comparing an expression level between two distinct experimental conditions. Log transformed data is easier to handle statistically. Here we compared high-fenofibrate-treated versus control.
- p.value: statistical significance
- p.value.adj: corrected p-value for multiple testing
Step 1: Pathways and identifier mapping databases
In addition to the experimental data file, you need two other types of files to use PathVisio:
- Pathways: A set of pathway files in GPML format (*.gpml files)
- Rat identifier mapping database: A species-specific identifier mapping database so PathVisio can take care of the identifier mapping step.
Note: For this workshop, we prepared USB sticks containing all the data files that you need for this analysis. If you want to repeat the analysis at home you can download the data from the following websites (you can also find the data for other species there):
- Pathways: You can find them on the USB stick in directory pathway-analysis/pathways-rno-2013-08-28. They have been downloaded from Wikipathways. You can find pathways for different species there.
- Identifier mapping database: You can find the mapping database for rat on the UBS stick (pathway-analysis/Rn_Derby_20120602.bridge). It has been downloaded from the BridgeDb website
Step 2: Data file preparation for PathVisio
PathVisio can load any type of quantitative data (expression values, fold changes, p values, confidence scores,…) or textual data if required. The data has to be saved as a tab separated file (.txt or .csv).
We already pre-processed the dataset described in Step 0 and provide a file containing the Ensembl identifier, the system code, the log fold change, p value and adjusted p value for the comparison high dose vs. control in the liver samples.
You can copy the data set from the USB stick or download it from here.
If you have your own data set and want to prepare it for the import in PathVisio, open the file with Excel and save it as a CSV (Comma separated) file.
Step 3: Import the data into PathVisio
- In the menu bar of PathVisio, click Data → Import expression data (Fig. 3a)
- Use the Browse buttons to locate the following files (Fig. 3b):
- Input file: The experiment data file (Feno_High_vs_Control.txt). Make sure that you have a local copy on the hard-drive (don’t use the file on the USB directly).
- Output file: Will be filled in automatically after selecting the input file, you don’t need to change this.
- Gene database: Use the identifier mapping database for rat (Rn_Derby_20120602.bridge).
- Click “Next”.
- Make sure that tab is selected, because the columns in our data are delimited by tabs. Check the preview if it looks as you would expect (Fig. 3c)
- Click “Next”.
- Select the columns that contain the gene identifiers and identifier type. In our data set we don’t have a system code column, so we have to select “Use the same system code for all rows“. Please select Ensembl and NOT Ensembl Rat (Fig. 3d). You can also use the Syscode column if you want.
- Click “Next”.
- The data will now be imported into an expression dataset that is saved as a .pgex file on your harddisk. Any exceptions will be reported to the file .pgex.ex. No exceptions should occur for our dataset (Fig. 3e).
- Click “Finish”.
- Note: An exception about old Ensembl identifiers might pop up. Please ignore this warning (Fig. 3f).
- In the footbar of PathVisio you can see which identifier mapping databases and which data set are loaded (see Fig 3g).
Step 4: Create a visualization by coloring logFC and p.value
Before we start with the pathway statistics to find changed pathways, we are going to specify how the data should be visualized on the pathways. We are going to test this with the Mitochondrial LC-Fatty Acid Beta-Oxidation pathway.
Tip: PathVisio allows you to change the default values for several settings (see Edit → Preferences → Display → Colors). In this visualization example, we changed the “Criteria not met” color to red:
The data set contains values for log fold change and p.value. We are going to visualize those two values in together on the gene nodes in the pathway.
- Go to Data → Visualization Options
- Create a new visualization by clicking the button in the top-right corner and select “New” (Fig. 4a).
- Specify a name for the visualization (e.g. “pathway-tutorial”) (Fig. 4b)
- Check the box in front of “Expression as color” and the box in front of “Text label” (Fig. 4c).
- In the expression as color panel, select Advanced. Then select the logFC column and create a new visualization (Fig. 4d).
- For the logFC it makes sense to use a gradient from -2 to 2. Choose a gradient from blue to yellow (blue being under-expressed, yellow being over-expressed, Fig 4e). Click Ok.
- Select the p.value column and create a new visualization (Fig. 4f). For the p.value we will define a color rule ([p.value] < 0.05), see Fig 4g. Click on new color set. Click on “Add Rule” Specify rule logic and color. Then press “Ok”.
- The pathway element are now split in two columns. The first column is the logFC gradient while the second column specifies if a measurement was significant or not (p-value < 0.05). In the legend tab on the right side, you can see which column in the pathway element represents what, see Fig 4h.
Open the Mitochondrial LC-Fatty Acid Beta-Oxidation pathway from the Rat pathways on the USB stick.
The pathway will now look somewhat like Fig. 4h.
Q3: Which two genes in the rat Mitochondrial LC-Fatty Acid Beta-Oxidation pathway have a high log fold change and are significantly changed?
Tip: To save the pathway with the data visualization, click on File -> Export. Here you can save the pathway in different formats so you can use it in presentations, like *.png.
Step 5: Search for regulated pathways
In the final step of this tutorial we are going to find out which pathways are enriched with regulated genes. We can then study these pathways and for example see whether they are influenced by the compound fed to the rats. These pathways might provide leads for further investigation of the biological implications.
To identify regulated pathways, we are going to use PathVisio to calculate a z-score for each pathway.
- Go to “Data->statistics”. (Fig. 5a)
- The “Pathway Statistics” dialog will open (Fig. 5b)
- In the text field below “Expression:”, type “([logFC] < -1 OR [logFC] > 1) AND [p.value] < 0.05” (without the quotes). This expression defines which genes are significantly changed (up or down) in gene expression in the high dose treated animals.
- In the text field below “Pathway Directory:”, fill in the directory where the pathway (gpml) files are located (see step 1). You can also use the “Browse” button to locate and select the directory.
- Click the “Calculate” button. You should see a progress dialog titled “Calculate Z-scores”.
- After a few minutes, the analysis should be finished and you will see a list of pathways appear in the dialog, (Fig. 5b).
- If you click on a pathway in the list, it will be opened. You can then apply the visualization created in the previous section to study the gene expression profiles and find out if any of the genes were changed in the data set, see Fig 5c.
- Save the list of pathways by clicking on the “Save results” button. You can open the statistical result then in Excel.
Note: Please be aware that the results can be slightly different due to recent changes in the pathway collection.
Q4: Have a close look at the highest ranked pathways. Are these in line with what you expect based on the known effects of PPARalpha activation?
|Fig 5a: Open statistics dialog.||Fig 5b: Define all settings and run statistics.||Fig 5c: Click a row in the result list to open the pathway.|
Optional Tutorials: Design your own pathway / Workflow integration
Optional 1: Design your own pathway → Learn how to draw a pathway.
If you finished the first part and still have time left, please continue with this tutorial.
WikiPathways was established to facilitate the contribution and maintenance of pathway information by the biology community. WikiPathways is an open, collaborative platform dedicated to the curation of biological pathways. WikiPathways thus presents a new model for pathway databases that enhances and complements ongoing efforts, such as KEGG, Reactome and Pathway Commons. Building on the same MediaWiki software that powers Wikipedia, we added a custom graphical pathway editing tool and integrated databases covering major gene, protein, and small-molecule systems. The familiar web-based format of WikiPathways greatly reduces the barrier to participate in pathway curation. More importantly, the open, public approach of WikiPathways allows for broader participation by the entire community, ranging from students to senior experts in each field. This approach also shifts the bulk of peer review, editorial curation, and maintenance to the community.
We are using the circadian clock pathway as an example in the tutorial. Please follow the 11 steps on the tutorials page.
Optional 2: Rerun the analysis from R, Perl or Python using PathVisioRPC.
If you have some programming experience, you can rerun the analysis that we just performed in PathVisio from any programming language that supports XMLRPC.
We are using Python as an example here (it is usually pre-installed on Linux and MacOSX, For Windows: install Python 3, double click on python-3.3.2.msi installer in the usb drive (or download it here)and follow on screen instructions). The XMLRPC module is pre-installed in python.
Please Note : The XMLRPC library has been named xmlrpc.client in Python 3 as opposed to xmlrpclib in Python 2. Change your code as necessary.
- You can either use the directory pathway-analysis/pathvisio-rpc on the USB stick or download the zip file containing all necessary files here (59 MB). Remember to extract the files after downloadeding the zip folder.
- First you need to start the PathVisioRPC server. The executable jar file PathVisioRPC-standalone.jar that you downloaded in the previous step launches the PathVisioRPC server on your local computer.
Open a terminal : use
cdto change the current working directory to the folder, which you downloaded and unzipped in the previous step. Then, type
java -jar PathVisioRPC-standalone.jarto start the PathVisioRPC server on port 7777. Leave this terminal open while running the script, you will see the server output during a request here.
- Now we can run the python script python PathVisioRPC-Python.py from the command line. Windows Users: Go to the ncsb-workshop-pathvisio-rpc folder and double click on the PathVisioRPC-Python-Windows.py script to execute it. The script will run for a while and produce results. The functions in the script are described below. If you want to redo the analysis with another data set you will need to change the file path and visualization settings in this file. For this tutorial it is important that all the files are present in the same directory so we can use the relative file locations. The commands in the script are simple and straightforward:
server.PathVisio.importData(...): performs the data import step and creates a pgex file in the result directory.
server.PathVisio.createVisualization(...): specifies the gradient and color rule that we used during the tutorial.
server.PathVisio.calculatePathwayStatistics(...): calculates and exports the z-score statistics results as HTML pages.
- Go in the results directory and open the index.html page. It will provide you an overview of the pathway analysis and you can click on the pathway list to show the pathway diagram. The nodes in the pathway are clickable and the backpage will be opened in another tab.
Tip: PathVisioRPC allows you to include PathVisio into your workflow. You can run multiple analysis a lot faster than doing it by hand.