- Annotation types
- Understanding the graph
- User properties
- Searching the graph
- Biological examples
- Technical remarks
- Referencing PANDORA
Proteomic and genomic research often deals with large protein sets. These can be results from microarray experiments or comparative proteomics; computationally-derived protein families; or even BLAST hit lists. Biological interpretation of such sets is time-consuming and requires intimate knowledge of each protein. Furthermore, it is difficult to gain a global view of the protein set and to detect biologically significant subsets. PANDORA was developed in order to allow in-depth biological analysis of such large protein sets. This is obtained through annotation analysis, with the implementation of two main ideas: representation of all protein-keyword relations with a Concept DAG (Directed Acyclic Graph), and integration of several annotation sources covering different biological aspects, such as: function, 3D structure, cellular localization, taxonomy and participation in biological processes. PANDORA is based on the proteins that appear in the UniProtKB/Swiss-Prot database and UniProtKB/TrEMBL.
2. Annotation types
We define an annotation (or a keyword) as a binary property that may be assigned to a protein, out of a library of properties (annotation sources).
What this means is that each protein may either have or not have a given annotation.
PANDORA is based on the UniProt protein database (an integration of the UniProtKB/Swiss-Prot and UniProtKB/TrEMBL databases). In the version that is currently used, our database includes 3,188,835 proteins. Several annotations sources were used to annotate the UniProt protein database. PANDORA currently supports the following annotation sources:
|Annotation source||Amount of annotations||Annotation type||Data Structure|
|UniProtKB/Swiss-Prot (15.4)||949||a wide range of annotations from very general to very specific.||unstructured.|
|InterPro (21.0)||18,638||sequential-based motifs. 6 categories: 'family', 'domain', 'repeat', 'PTM', 'active site' and 'binding site'.||partly structured as small trees.|
|GO* (Gene Ontology) (1.7)||27,050||3 categories: 'molecular function', 'cellular component' and 'biological process'.||each category structured as a DAG.|
|ENZYME (June 2009)||5,190||enzymatic functional annotations.||tree structure, 4 levels (corresponding to the 4 numbers of the EC entry): 'class', 'subclass', 'sub-subclass' and 'enzyme entry'.|
|NCBI Taxonomy (June 2009)||442,867||taxonomical annotations.||tree structure.|
|SCOP (1.75)||7,821||structural annotations.||tree structure, 4 levels: 'class', 'fold', 'superfamily' and 'family'.|
* - please note that in our implementation for the assignment of GO annotations to proteins, proteins are assigned not only the annotations given to it by the EBI mapping, but also all the annotations that are its parents in the GO hierarchy.
Using annotation sources cleverly will allow you to extract rich biologial information. View your protein sets through different annotation sources to learn about different aspects of the set. You may also choose to see annotations from multiple annotation sources simultaneously (although this should be used carefully so graphs dont become too complex). To select multiple types simply hold down 'ctrl' while clicking on the annotation type. Choosing all levels of the taxonomy/ENZYME/SCOP annotations will give you the taxonomical/enzymatic/structural tree for your protein set. Educated use of zooming combined with multiple annotation types can be a powerful tool, as can be appreciated from some biological examples.
3.1 User set:This option allows the user to upload a file containing a list of proteins with or without additional properties. You may also select a specific background database from which the evaluation statistics will be calculated. If your file does not contain additional properties, just supply a list of protein IDs, separated by space / tab / newline / comma / semicolon. Note that PANDORA now supports GenBank gene accession numbers as well as the UniProtKB/Swiss-Prot/UniProtKB/TrEMBL protein accession numbers.
If your file contains additional properties, you should use the following format:
- The file should be constructed from columns. The first column should contain accession numbers, and the next columns should contain the properties. Columns can be separated by tab ("tab delimited format") or by a comma ("CSV format").
- The first row should contain the type of the column. The first column containing the accession IDs should have type "a", and the following columns denoted by type: binary property ("b"), multiple binary properties ("m") or quantitative property ("q").
- The second row should contain the name of the data in the column.
- The third row and on should contain data.
3.2 Keyword:For research on preselected protein sets, PANDORA provides the ability to choose a set of proteins containing a keyword of interest to you. Simply enter the annotation or a part of it, and you will be supplied with a list of annotations that match. Choosing one or more keywords will fetch all proteins that have ANY of the selected annotations (a union of the protein sets of all chosen keywords).
3.3 BLAST:BLAST searches often produce long lists of matching proteins for your query sequence. Extracting biological information from this list can be a difficult task, and people usually just go through the names of the matching proteins or look only at the first few matches. The BLAST feature in PANDORA lets you run a NCBI-BLAST search and view the results in PANDORA, by using the matching proteins as the basic protein set and the matching E-values as a quantitative property on them. This makes it very easy to explore the results in various biological aspects and detect biological groups with significant E-values. Then you can go on to align them by opening the protein list (click on the top node) and sending them directly to CLUSTALW for multiple sequence alignment.
3.4 PEPTIDES:Peptides of > 600 daltons are supported. Currently Peptides from Rat, Mouse, Human, Drosophila and Yeast proteomes are fully supported. Peptides that results from Trypsin and other common proteases cleavage are included. Currently, only complete cleavage is supported. The peptides are submitted as unmodified version.
4. Understanding the graph
PANDORA displays a Directed Acyclic Graph (DAG) summarizing the annotations given to a set of proteins.
4.1 Construction of the graph: You could think of the annotations on your protein set as a binary matrix where the rows are annotations and the columns are your proteins (Figure 1). Each row describes a subset of proteins that share a certain biological property. Each of the subsets are the basic nodes of the graph. PANDORA compares these nodes and constructs a hierarchical graph of them. Each node represents a set of proteins that were all assigned a common annotation. When comparing two nodes there are three possible cases:
- Sets are equal: Nodes will be merged into one set of proteins. The annotation that will be assigned to this node will be that of both parents.
- One set is a subset of the other set: An edge will be created between the two nodes, and the subset will be placed beneath.
- Sets intersect (excluding the previous case): A new node will be created containing the intersection of the two sets. The annotation that will be assigned to this node will be that of both parents. Edges will be created from the two nodes two the new node, and it will be placed beneath them.
- Sets are disjoint: Leave as two separate nodes.
4.2 The graph:Now let us consider the final graph (Figure 2). Nodes in the graph (appear as red and white balls) represent sets of proteins, sharing a unique combination of annotations. Their size is relative to the amount of proteins in them. To see the annotations given to the proteins of a node, move the mouse pointer over a node. The edges (appear as green lines) represent subset/superset relations between the nodes, with a top-to-bottom directionality. This means that if node A is connected to node B which is beneath it, A is a superset of B. This provides a simple yet important rule to follow: each of the proteins of a node share its annotations and the annotations of ALL ITS ANCESTORS in the graph. The node at the top of the graph represents all the proteins of your set, even if the proteins do not share any annotation (in this case it will be marked as "BS" - Basic Set). Clicking this node will open a window that lists the protein of this set (Figure 3). Clicking any other node will show the proteins of that node as a new graph (see "zooming"). To view the proteins of any other node in the graph, you will have to first click on it, and in the new window that opens click on the top node.
4.3 Graph Evaluation
In some cases assessing the biological "quality" or significance of your protein set can be insightful.
To understand this concept, let us say you see that your protein set of 10 proteins all share a certain annotation.
How significant is this biologically?
Well, this depends on what annotation they share: obviously, if they share an annotation that is very rare and appears only 10 times in the database, it would be considered very significant.
On the other hand, if the annotation is highly abundant in the database (for example the annotation "enzyme"), it may be less interesting biologically.
In order to deal with this issue PANDORA offers two evaluation methods:
4.3.1 Evaluation Table
The evaluation table gives different measures for evaluating the appearance of annotations in your protein set. Each row in the table represents an annotation. For each annotation, four measures are provided: sensitivity, specificity, p-value and corrected p-value.
For a given annotation a, Let P be the number of proteins in your set, N the number of proteins in your set that have the annotation a, D the number of proteins in the database and K the number of proteins in the database that have annotation a. Because the statistics are calculated according to a given database, it is important to select the appropriate background database when you are studying your own set of proteins.
Sensitivity is defined as: Sensitivity(a) = N/K. Sensitivity measures what fraction of the proteins with annotation a are in your set out of the background database (also known as "recall").
Specificity is defined as: Specificity(a) = N/P. Specificity measures what fraction of your protein set has annotation a (also known as "accuracy" or "precision").
The P-value is defined as: P-value(a) =
In order to calculate the P-value efficiently PANDORA uses a very good approximation of the binomial coefficient (published by Stanica P, JIPAM vol 2 article 30):
P-value represents the probability of finding N or more proteins that have the annotation a by chance, given P, D and K.
The FDR-corrected p-value is defined as: the p-value multiplied by the total number of annotations for the proteins of your set, divided by the rank of the this p-value (the index of the p-values when they are sorted). This is known as the Benjamini and Hochberg FDR correction (1995). Correcting the p-value is necessary due to the fact that PANDORA is actually considering multiple hypotheses by testing the p-values of all annotations that are assigned to your set.
4.3.2 Background databasesWhen evaluating significance of results, it is critical to know what is the background distribution from which the samples were taken. In the case of PANDORA evaluation, it is essential to know from what "pool" of proteins your protein set was taken. PANDORA offers 3 types of background databases to select from:
- UniProtKB/Swiss-Prot | UniProtKB/TrEMBL: All the proteins in the PANDORA database.
- Species specific: All proteins that belong to a specific species. You can either select a species or let PANDORA detect the species automatically.
- Microarray: All proteins of a specific microarray. We currently offer a selection of popular Affymetrix microarrays. If you have requests for other microarray backgrounds, let me know.
4.3.3 Node colors
Another way to evaluate your protein set is by studying the node colors. Recall that each node consists of a subset of proteins that share a group of annotations. For each node and annotation we can calculate the node's sensitivity for that annotation (as defined above). The node's color represents the highest sensitivity of the node to any of its annotations (e.g. if a node has sensitivity values of 0.11, 0.32 and 0.87 for its three annotations, the color would represent the 0.87 sensitivity). The more white the node is, the higher its sensitivity (a completely white node has a sensitivity close of around 1, and a completely red node has a sensitivity of around 0). You can see the exact number of appearances in the database of every annotation in the tooltip which opens from each node.
For some nodes the sensitivity is not well-defined and these nodes appear as a red-white swirl (undetermined sensitivity).
4.4 Quantitative properties on the graph
When using quantitative properties (see section about user properties), you will see small bar-shaped color histograms next to each node (Figure 4). Recall that each node represents a protein set, therefore the histogram shows the distribution of the quantitative properties on its proteins. You can point the mouse over the color histograms to see graph histograms of the distribution. On the upper left corner you will see a color legend, with tooltips showing the value range represented by each color.
A white line across each color bar shows the fraction of proteins in the node which have values for the quantitative property. This is important because the color histogram alone may be misleading: for example, if in a node of 10 proteins there are quantitative values for only one protein, then the histogram would show only the data for this protein, misleading the user to believe that all proteins of the node behave similarly. The higher the white line appears on the bar, the more of its proteins are represented by the histogram. If the line appears at the top end of the bar, all proteins in the node have values for the quantitative property and are represented in the histogram.
In certain cases, you may wish to change the value range for the histograms. For example, if most of your property values are between 2 to -2, and only one value is 20, the color histogram will range from -2 to 20, and most of the nodes will have the same color, thus making the histogram uninformative. In this case you would want to manually set the histogram range from -2 to 2 for example.
In many cases, the graph of annotations on a set on proteins can be very complex and prove to be hard to decipher. Furthermore these graphs make it hard to gain a global view of the data. Resolution helps deal with complex graphs. The main idea is to simplify the graph so that you can look at a "rough draft" of the graph and then focus only on subgraphs that interest you. Conceptually, resolution is a threshold between the simplicity and the accuracy of the graph. The resolution (R) is given in number of proteins, but can be entered also as a number 0 < X < 1, which will automatically set the resolution to: R = X * size of set. The graph will be simplified so that its accuracy remains in the range of R proteins. As the value of R increases, we trade off graph accuracy for simplicity. For example, at a resolution of 2 proteins, 2 nodes (each representing a protein group) that differ in 2 proteins will be considered equal, and merged into one node. So, the graph becomes simpler but we lose the "fine details". When R=0, all data of protein and keyword relations are displayed (no simplification). However, at higher resolutions some of the data is lost, so the graph can be simplified. Changing the resolution enables the user to control the amount of data which is lost. The default resolution 0.01, meaning the number of proteins in the set divided by 100. To change the resolution of a graph, type in the resolution and click refresh.
Clicking on any node apart from the basic set will open a new window in which the protein group you clicked on becomes the basic set. This in effect "zooms in" on the node you clicked on, because it allows viewing a subset at better resolution while removing nodes and edges that are irrelevant. Remember that after you "zoom in" on a subset, you can view it by other keyword types as well. For example, after selecting an interesting subset of proteins which share a functional word (like "hydrolase"), you can choose to see the new subset through a new keyword type, for example taxonomical keywords, and so on. This is an important concept that will allow you to access a wide range of biological information.
7. User properties
In some cases, you might want to inspect biological properties that are not included in the PANDORA sources. Furthermore, in many cases these properties may be naturally quantitative, not binary. For example, if you are looking at proteins that were shown to be up-regulated in an experiment, you might want to consider how much each protein has been up-regulated. For such purposes, PANDORA allows the introduction of "user properties".
If the user property is binary, it is just treated as an extra annotation. However, if it is quantitative, it is dealt with differently. After constructing the graph based on the binary annotations, each node will have a distribution of the quantitative property on it (recall that each node represents a set of proteins). The distribution of the property on the nodes are visualized by using color histograms for each node. This allows you to easily detect nodes whose proteins have an interesting distribution. For example, if our quantitative property is "change in expression", you could easily detect nodes who are distinctly up-regulated or down-regulated. Such nodes are subsets of proteins that share biological properties AND are similarly regulated.
To see an example of how to use this feature look at the user set input example.
8. Searching the graph:
9. Biological examples
PANDORA can be used in several ways to study protein sets. To see some interesting examples, check out the reference paper. Some further biological example will soon be added.
10. Technical remarks
- The PANDORA website is written in PHP and HTML.
- The following UniProtKB/Swiss-Prot annotations are ignored: '3D Structure', 'Repeat' and 'Complete Proteome'. The following GO annotations are ignored: 'Molecular Function', 'Cellular Localization' and 'Biological Process'.
11. Referencing PANDORA
If you have found PANDORA useful in your research, please reference the following:
Kaplan N, Vaaknin A and Linial M. (2003).
PANDORA: keyword-based analysis of protein sets by integration of annotation sources.
Nucleic Acids Research 31 5617-5626
Full text free