Biologists regularly search databases of DNA or protein sequences for evolutionary or functional relationships to a given query sequence. We describe a ranking algorithm that exploits the entire network structure of similarity relationships among proteins in a sequence database by performing a diffusion operation on a pre-computed, weighted network. The resulting ranking algorithm, evaluated using a human-curated database of protein structures, is efficient and provides significantly better rankings than a local network search algorithm such as PSI-BLAST. Protein ranking: from local to global structure in the protein similarity network -- Supplementary data
Jason Weston, Andre Elisseeff, Dengyong Zhou, Christina Leslie and William Stafford Noble
Abstract
- Supplementary results (in PDF format).
- Animation (GIF) of Figure 12 from the supplement.
- ROC50 scores for all queries and all detection methods from the paper in plain text format.
- Training sequences. These are given as indexes into the SCOP set (first column) in the same order as the fasta file below. The third column is the superfamily they belong to, given as a class ID, the fourth is the fold (as defined in SCOP).
- Testing sequences with testing sequences (queries), in the same format as above.
- Training and testing sequences, same format as above. Pass this file in as the second argument for
eval.cpp
.
- SCOP sequence file in FASTA format containing all sequences in SCOP version 1.59 with less than 95% identity.
- Swiss-Prot Sequence file in FASTA format containing all sequences in Swiss-Prot version 40 (zipped, 21 Mb).
- 7329x7329 Kernel matrices for methods used in the experiments: (here are the IDs by row or column)
- BLAST matrix, ascii text file, gzipped (49 MB).
- PSI-BLAST(SCOP) matrix using the complete 7329 examples as a database, ascii text file, gzipped (52 MB).
- PSI-BLAST (SCOP+SPROT) matrix using all SCOP+SPROT examples as a database, ascii text file, gzipped (9 MB).
- 108,931x108,931 PSIBLAST score kernel matrix for RankProp used in the experiments (342 Mb): the first 7329 IDs are from SCOP, IDs from 7330 onwards are SPROT proteins. Format: <index> <number of homologs> <indices of homlogs> <e-values of homologs>.
- 7329x108,931 PSIBLAST score kernel matrix for RankProp used for the queries in the experiments (189 Mb): unlike the above file, all edges are given (not just the first 1000).
- C++ code to run the experiments:
- RankProp code (there is also a more general command line driven version here).
- Evaluation of a ranking provided by a given distance matrix, returns ROC-50 scores of each query.