PREDATOR: Protein secondary structure prediction from a single sequence or a set of sequences Version 2.1, February 1997 Dmitrij Frishman (1) & Patrick Argos (2) (1) Munich Information Center for Protein Sequences Max-Planck-Institute for Biochemistry Am Klopferspitz 18a, 82152 Martinsried Germany FRISHMAN@MIPS.BIOCHEM.MPG.DE (2) European Molecular Biology Laboratory Postfach 102209, Meyerhofstrasse 1 69012 Heidelberg Germany ARGOS@EMBL-HEIDELBERG.DE CONTENTS 1. About the method.............................................. 2. Copyright notice.............................................. 3. Availability.................................................. 4. Installation.................................................. 5. Input......................................................... 6. Using PREDATOR................................................ 7. Output........................................................ 8. Practical considerations...................................... 9. Bug reports and user feedback................................. 10. References.................................................... 1. About the method PREDATOR [1,2] is a secondary structure prediction program. It takes as input a single protein sequence to be predicted and can optimally use a set of unaligned sequences as additional information to predict the query sequence. The mean prediction accuracy of PREDATOR is 68% for a single sequence and 75% for a set of related sequences. PREDATOR does not use multiple sequence alignment. Instead, it relies on careful pairwise local alignments of the sequences in the set with the query sequence to be predicted. If you supply a set of sequences in the form of a multiple alignment in CLUSTAL or MSF format, the sequences will be used but as unaligned. Below follow the abstracts of the papers describing the method. Prediction from a single sequence [1]: "Existing approaches to protein secondary structure prediction from the amino acid sequence usually rely on statistics of local residue interactions within a sliding window and the secondary structural state of the central residue. The practically achieved accuracy limit of such single residue and single sequence prediction methods is about 65% in three structural states (a-helix, b-strand, and coil). Further improvement in the prediction quality is likely to require exploitation of various aspects of three-dimensional protein architecture. Here we make such an attempt and present an accurate algorithm for secondary structure prediction based on recognition of potentially hydrogen-bonded residues in the amino acid sequence. The unique feature of our approach involves data-base derived statistics on residue type occurrences in different classes of b-bridges to delineate interacting b-strands. The a-helical structures are also recognized on the basis of amino acid occurrences in hydrogen-bonded pairs (i,i+4). The algorithm has a prediction accuracy 68% in three structural states on only a single protein sequence, and has the potential to be improved by 5-7% if homologous aligned sequences are also considered". Prediction from a set of sequences [2]: "In this work we present an accurate secondary structure prediction procedure for a query sequence with related sequences. The most novel aspect of our approach is its reliance on pairwise alignments of the related sequences rather than utilization of a multiple alignment. The residue-by-residue accuracy of the method is 75% in three structural states after jack-knife tests. The gain in prediction accuracy compared to the existing techniques (which are near 72%) is achieved by better secondary structure propensities for individual sequences which account for long-range effects, utilization of homologous information in the from of carefully selected pairwise alignment fragments, and reliance on a much larger collection of protein primary structures. The method is especially appropriate for large-scale sequence analysis efforts, such as genome characterization, where precise and significant multiple alignments are not available or achievable". 2. Copyright notice All rights are reserved for the whole or part of the program. Permission to use, copy, and modify this software and its documentation is granted for academic use provided that: i. this copyright notice appears in all copies of the software and related documentation; ii. references given below [1,2] must be cited in any publication of scientific results based in part or completely on the use of the program; and iii. bugs will be reported to the authors. The use of the software in commercial activities is not allowed without prior permission from the authors. WARNING: PREDATOR is provided "as-is" and without warranty of any kind, express, implied or otherwise, including without limitation any warranty of merchantability or fitness for a particular purpose. In no event will the authors be liable for any special, incidental, indirect or consequential damages of any kind, or any damages whatsoever resulting from loss of data or profits, whether or not advised of the possibility of damage, and on any theory of liability, arising out of or in connection with the use or performance of this software. 3. Availability PREDATOR exists for UNIX and DOS. Documentation, data files and source code are available by anonymous FTP from ftp.ebi.ac.uk (directories /pub/software/unix/predator, /pub/software/dos/predator). A DOS executable is also available. Protein sequences can be submitted for secondary structure prediction either to WWW URL http://www.embl- heidelberg.de/predator/predator_info.html or through electronic mail to predator@embl-heidelberg.de. A mail message containing HELP in the first line will be answered with appropriate instructions. 4. Installation The program is supplied with two database files - stride.dat and dssp.dat - which contain propensity tables, secondary structural assignments and thresholds for two assignment methods - STRIDE [3] and DSSP [4]. One of these files is read by the program every time it is started. The environment variable PRE_DIR has to contain the name of the directory in which the files are located. For example, if you work under UNIX in csh, you will have to include the following statement in your .cshrc file: setenv PRE_DIR /your/directory/with/these/files In DOS, your autoexec.bat file has to contain the following statement: set PRE_DIR=disk:\your\directory\with\these\files If the variable PRE_DIR is not set, PREDATOR will look for the files stride.dat and dssp.dat in the current directory. You can also explicitly specify the location of the database file using the option -b (see below). 5. Input PREDATOR takes as input a sequence file in FASTA, MSF or CLUSTAL format containing one or many protein sequences. By default, the prediction will be made for the first sequence in the set. This can be changed using the options -i or -a (see below). Fasta format looks like this: > One line description of the first sequence (the first word will be used as sequence id) HGFSADSAREYPLKSASDSDA ERYTARWFDSGHKLNXMMS > One line description of the second sequence RYTSGFHAWQWDMNKLBNSSS etc The ability to read MSF and CLUSTAL formats is added for convenience. The actual multiple alignment will not be used. 6. Using PREDATOR The only required parameter for PREDATOR is the name of the file containing a protein sequence or a set of sequences in FASTA, MSF or CLUSTAL format. By default PREDATOR writes to standard output; i.e., your screen. On systems that allow to redirect output, you can do so to create a disk file. Help is available if you just type PREDATOR without parameters. The following options are accepted. General options -fFilename write output to the file "Filename" rather than to stdout. -l long output form, in which every output line contains residue number, three-letter residue name, one-letter residue name, predicted secondary structural state and reliability estimate. If a STRIDE or DSSP secondary structure assignment has been read (options -x and -y, see below), the known assignment will also be shown in the output for comparison. By default the short output form is used. -o output sequence(s) and die -h indicate progress by dots and output other additional information if available. Selection of sequences to predict -a make prediction for All sequences in the input file -iSeqId make prediction for the sequence SeqId By default prediction is made for the first sequence in the set Prediction options -s perform single sequence prediction. Ignore other sequences in the set. -r preserve the original alignment in the CLUSTAL or MSF file (do not unalign) /*Not implemented yet!*/ -u Do not copy assignment directly from the PDB database if query sequence is found in PDB. By default, the known conformation of 7-residue segments will be used if they are identical to a 7-residue fragment in the query sequence. -d use DSSP target assignment (default is STRIDE). The predictions made with DSSP and STRIDE target assignments are optimized to reproduce these assignments as well as possible. -bFilename use database file FileName Comparison with the known assignment (for testing): -xFileName read STRIDE file -yFileName read DSSP file -zChain PDB Chain (must be specified if option -x or -y are used) Additional functions: -nPercentId find a subset of sequences with no more than PercentId identity between any pair of sequences (quick and dirty algorithm) All options are case- and position-insensitive. Sequence names (option -i) are case sensitive. Examples: 1. Predict secondary structure for the single sequence globin.seq predator globin.seq 2. Predict secondary structure for the first sequence in the multiple sequence file globins.seq predator globins.seq 3. Predict secondary structure of the first sequence in the multiple alignment file globins.aln. Create long output and write it in the file globins.pred predator -l globins.aln -fglobins.pred 4. Predict secondary structure of the sequence glob_ecoli in the multiple alignment file globins.aln predator globins.aln -iGLOB_ECOLI 5. Read sequence from the file 5ruba.seq, make prediction and compare it with the known assignment for the chain A from the file 5rub.str. predator 5ruba.seq -x5rub.str -za NOTE: chain " " must be specified as "-"; e.g. -z- 7. Output Short output form: Secondary structure states of amino acids are indicated by the letters "H" (helix), "E" (extended or sheet), and "_" (coil). Long output form: Secondary structure states of amino acids are indicated by letters "H" or "h" (helix), "E" or "e" (extended), and "C" or "c" (coil). The prediction is shown in lower case except for those residues for which the assignment was directly copied from the PDB database. This feature is added so that you can distinguish between the predictions actually made by PREDATOR and those taken from known structures. The prediction is contained in the records beginning with the identifier PRED in the first columns. For each amino acid site of your sequence, residue number, three- and one-letter residue code, prediction, reliability estimate, and the number of residues from related sequences projected onto this residue through the local alignment procedure are shown in subsequent columns. Additionally, if the STRIDE or DSSP assignments have been read using the options -x or -y (and -z), the last column of the PREDATOR output will contain the actual secondary structural assignment for your sequence if it corresponds exactly to the one in the STRIDE or DSSP file (for comparison). If the known assignment is not available, i.e., if you did not use the -x or -y options, question signs will be output. Both output forms: If option -h has been used, PREDATOR will show progress by printing dots on the standard output. If your sequence has related sequences with known 3D structure, PDB identifiers of these sequences will be printed. 8. Practical considerations 1. For long sequences and large sequence sets PREDATOR can be slow. It is recommended to use the option -h to monitor its progress. Note that some of the sequence identifiers will be skipped if there are no significant local alignments between them and the query sequence. On the other hand, some sequences can appear more than once if there are several local alignments. If you need a prediction for only one sequence in the set, do not use the -a option as it may significantly slow down the computation. In this case PREDATOR will calculate full secondary structure propensities for all sequence in the set rather than just for selected pieces significantly aligned with the query sequence. On the other hand, if you really need a prediction for all sequences, it is more time-efficient to use the -a option rather than running PREDATOR on each sequence in the set at a time since in this case all propensities have to be calculated anyway. 2. If some sequences in the set have no fragments significantly related to the query sequence, they will not be used for the prediction. Thus, it is NOT a problem to have unrelated sequences in your sequence set. 3. The quality of the prediction depends dramatically on the number of sequences in the set. The more sequence information you provide, the better results you will get. Therefore, it is strongly recommended to perform a sensitive sequence search with your sequence against the largest sequence database available to extract as many related sequences as you can. For example, the FASTA program with ktuple=1 gives good results. The largest collections of protein sequences (about 160000 entries) are currently TREMBL and GENPEPT. The following steps are recommended. a) Run a database search with your single sequence against a large sequence database. b) Make the set of sequences extracted as a result of the database search non- redundant such that no two sequences share more than 95% sequence identity (e.g., predator -n95 filename). c) Merge your search sequence with the resulting non-redundant set. d) Perform secondary structure prediction. You can submit your sequence for prediction to the WWW or e-mail server (see availability) where all these steps will be done automatically for you. 4. If the -l option is specified, PREDATOR will output the reliability index for each residue predicted. In general, reliability values higher then 0.8 indicate sequence sites predicted at about 90% accuracy. If the prediction reliability for a given residue equals zero, it does not mean that the prediction is completely unreliable. It means that there was insufficient statistics to derive the reliability for the set of secondary structure propensities associated with this residue; i.e., the reliability is unknown. Less than 10% of residues generally do not have a reliability estimate. Sequence fragments with atypical composition (e.g., "asfasfasfasfasaf" ) will have no reliability estimate as well as short proteins (less then 50 amino acids) where statistics are scarce. 5. If your sequence has a closely related protein with known 3D structure, the secondary structure prediction of every 7-residue fragment in your sequence identical with the PDB sequence will be substituted by the known tertiary structural assignments. PREDATOR currently relies on a database of 556 PDB chains with pairwise identity no higher than 30%. Please note that no additional filtering of the prediction is made after this such that it can contain, for instance, helices of length one. Related PDB structures will be reported if the option -h was specified. To avoid this substitution (for example for testing), use the option -u. 9. Bug reports and user feedback Please send your suggestions, questions and bug reports to FRISHMAN@MIPS.BIOCHEM.MPG.DE. Send your contact address to get information on updates and new features. 10. References 1. Frishman, D. & Argos, P. (1996) Incorporation of long-distance interactions into a secondary structure prediction algorithm. Protein Engineering, 9, 133-142. 2. Frishman, D. & Argos, P. (1997) 75% accuracy in protein secondary structure prediction. Proteins, 27, 329-335. 3. Frishman,D & Argos,P. (1995) Knowledge-based secondary structure assignment. Proteins: structure, function and genetics, 23, 566-579. 4. Kabsch,W. & Sander,C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22: 2577-2637. - i -