The Protein Information Resource (PIR)

Winona C. Barker; John S. Garavelli; Hongzhan Huang; Peter B. McGarvey; Bruce C. Orcutt; Geetha Y. Srinivasarao; Chunlin Xiao; Lai-Su L. Yeh; Robert S. Ledley; Joseph F. Janda; Friedhelm Pfeiffer; Hans-Werner Mewes; Akira Tsugita; Cathy Wu

doi:10.1093/nar/28.1.41

Barker, Winona C.;Garavelli, John S.;Huang, Hongzhan;McGarvey, Peter B.;Orcutt, Bruce C.;Srinivasarao, Geetha Y.;Xiao, Chunlin;Yeh, Lai-Su L.;Ledley, Robert S.;Janda, Joseph F.;Pfeiffer, Friedhelm;Mewes, Hans-Werner;Tsugita, Akira;Wu, Cathy

2000-01-01 00:00:00

© 2000 Oxford University Press Nucleic Acids Research, 2000, Vol. 28, No. 1 41–44 Winona C. Barker*, John S. Garavelli, Hongzhan Huang, Peter B. McGarvey, Bruce C. Orcutt, Geetha Y. Srinivasarao, Chunlin Xiao, Lai-Su L. Yeh, Robert S. Ledley, 1 1 2 Joseph F. Janda, Friedhelm Pfeiffer , Hans-Werner Mewes , Akira Tsugita and Cathy Wu Protein Information Resource, National Biomedical Research Foundation, 3900 Reservoir Road, NW, Washington, DC 20007, USA, GSF-Forschungszentrum für Umwelt und Gesundheit, Munich Information Center for Protein Sequences am Max-Planck-Instut für Biochemie, Am Klopferspitz 18, D-82152 Martinsried, Germany and Japan International Protein Information Database, Amakubo 1-16-1, Tsukuba 305-0005, Japan Received October 1, 1999; Accepted October 4, 1999 ABSTRACT to the resources available on the WWW. Among the key developments are complete protein family organization for the The Protein Information Resource (PIR) produces the PIR-International Protein Sequence Database (PSD) and largest, most comprehensive, annotated protein integrated WWW interfaces for user-friendly sequence sequence database in the public domain, the PIR- analysis, database searching and information retrieval. International Protein Sequence Database, in collabo- ration with the Munich Information Center for Protein THE PIR-INTERNATIONAL PROTEIN DATABASES Sequences (MIPS) and the Japan International PIR, MIPS and JIPID constitute the PIR-International consortium Protein Sequence Database (JIPID). The expanded that maintains the PIR-International Protein Sequence Database PIR WWW site allows sequence similarity and text (PSD), the largest publicly distributed and freely available searching of the Protein Sequence Database and protein sequence database. The database has the following auxiliary databases. Several new web-based search distinguishing features. engines combine searches of sequence similarity • It is a comprehensive, annotated, and non-redundant protein and database annotation to facilitate the analysis and sequence database, containing over 142 000 sequences as of functional identification of proteins. New capabilities September 1999. Included are sequences from the for searching the PIR sequence databases include completely sequenced genomes of 16 prokaryotes, six annotation-sorted search, domain search, combined archaebacteria, 17 viruses and phages, >100 eukaryote global and domain search, and interactive text organelles and Saccharomyces cerevisiae. • The collection is well organized with >99% of entries classi- searches. The PIR-International databases and fied by protein family and >57% classified by protein super- search tools are accessible on the PIR WWW site at family. http://pir.georgetown.edu and at the MIPS WWW site • PSD annotation includes concurrent cross-references to at http://www.mips.biochem.mpg.de . The PIR-Inter- other sequence, structure, genomic and citation databases, national Protein Sequence Database and other files including the public nucleic acid sequence databases are also available by FTP. ENTREZ, MEDLINE, PDB, GDB, OMIM, FlyBase, MIPS/ Yeast, SGD/Yeast, MIPS/Arabidopsis and TIGR. Where these databases are publicly and freely accessible and INTRODUCTION provide suitable WWW access, the cross-references The accelerating pace of genome sequencing projects has presentedonthe PIR WWW site arehot-linkedsothat greatly increased the volume and complexity of available searchers can consult the most current data. molecular data. To realize the fullest possible value from the • The PIR is the only sequence database to provide context data and to gain a better understanding of the genome, databases cross-references between its own database entries. These and the computational tools for analyzing them are required to cross-references assist searchers in exploring relationships allow biologically relevant features in the sequences to be such as subunit associations in molecular complexes, identified and to provide insight on their structure and function. enzyme–substrate interactions, activation and regulation For over 30 years, the Protein Information Resource (PIR) has cascades, as well as in browsing entries with shared features been providing the scientific community with databases and and annotations. tools for the organization and analysis of protein sequence data • Interim updates are made publicly available on a weekly (1,2). Together with MIPS and JIPID, we have undertaken a basis, and full releases have been published quarterly since major restructuring to meet the challenges presented by the 1984. rapid growth of largely uncharacterized sequence data and the In addition to the PSD, PIR-International distributes or opportunities provided by the nearly universal access of scientists provides WWW access to other sequence and auxiliary databases *To whom correspondence should be addressed. Tel. +1 202 687 2121; Fax: +1 202 687 1662; Email: [email protected] 42 Nucleic Acids Research, 2000, Vol. 28, No. 1 Table 1. PIR-International sequence and auxiliary databases Database Description Information PSD Annotated and classified protein sequences http://pir.georgetown.edu/pirwww/dbinfo/textpsd.html PATCHX Sequences not yet in the PIR-International PSD http://pir.georgetown.edu/pirwww/dbinfo/patchx.html ARCHIVE Sequences as originally reported in a publication or submission http://pir.georgetown.edu/pirwww/dbinfo/archive.html NRL_3D Sequences from three-dimensional structure database PDB http://pir.georgetown.edu/pirwww/dbinfo/nrl3d.html FAMBASE Representative sequences from each protein family http://pir.georgetown.edu/pirwww/dbinfo/fambase.html PIR-ALN Sequence alignments of superfamilies, families and homology domains http://pir.georgetown.edu/pirwww/dbinfo/piraln.html RESID Post-translational modifications with PSD feature information http://pir.georgetown.edu/pirwww/dbinfo/resid.html ProClass Non-redundant sequences organized according to superfamilies and motifs http://pir.georgetown.edu/gfserver/proclass.html ProtFam Sequence alignments of superfamilies http://www.mips.biochem.mpg.de/proj/protfam/protfam (Table 1), briefly described below, and maintains several Superfamily and family classification internal data collections used for sequence annotation and The pioneering work of Margaret Dayhoff on protein classifi- integrity checks. cation based on the superfamily concept (10,11) was refined by • PATCHX (3) is a non-redundant database assembled by PIR-International (12) to assist database organization and MIPS of publicly available protein sequences not yet in the molecular evolution studies. Central to the organization and PIR-International PSD. PIR+PATCHX, a combination of annotation of the PIR-International sequence and auxiliary the PSD and PATCHX containing ~300 000 sequences databases are the protein family relationships which are structured available for similarity searches, is the most complete non- at three levels: (i) superfamilies and families (for full-length redundant collection of protein sequences available in the sequence similarity), (ii) homology domains (for local functional or structural units), and (iii) motifs (for functional or structural public domain. sites). PIR-International has maintained the highest classification • ARCHIVE is a database of protein sequences as originally rate and provided the most comprehensive classification and reported in a publication or submission, the only such collection alignments of proteins among all major public domain databases. of ‘as published’ unmerged sequences. To deal efficiently with the many new sequences from • NRL_3D (4) sequence-structure database is produced from genome sequencing projects, procedures for family and super- sequence and annotation in the Protein Data Bank (PDB) of family classification have been automated. Over 99% of three-dimensional structures (5). sequences are routinely classified shortly after entry into the • FAMBASE is a collection of representative sequences from database into protein families of sequences that are at least each protein family that can be used in a similarity search to 45% identical. Subsequently, entries are further clustered into reduce search time and improve sensitivity for identifying regular superfamilies of sequences that share end-to-end distant families. homology (but may be rather distantly related) and also into • PIR-ALN (6) is a curated database of sequence alignments domain superfamilies of proteins sharing at least one common of superfamilies, families and homology domains, with homology domain. There are currently >76 000 sequences in annotation information derived from PSD and consensus >8900 superfamilies, and 30 000 entries with 370 recognized patterns calculated from the alignments. homology domains in the PSD. Corresponding to the classification are 1500 superfamily, 2100 family and 400 domain alignments • RESID (7) is a database of post-translational modifications in the PIR-ALN database, and 15 000 family and 4500 super- with descriptive, chemical, structural and bibliographic family alignments in the MIPS ProtFam database. Also available information based on feature information in the PSD. from PIR is the ProClass protein family database containing • ProClass (8) is a protein family database that organizes non- 92 000 classified entries as well as 1300 motif alignments of redundant PIR-International PSD and SWISS-PROT sequences ~44 000 PIR-International PSD entries. according to PIR superfamilies and PROSITE patterns. • ProtFam (9) is a curated database of homology clusters with automatically generated multiple sequence alignments for THE PIR SEARCH AND ANALYSIS SYSTEM families, superfamilies and homology domains. The PIR search and analysis system provides search engines of three types (Table 2): (i) interactive text-based search engines, To support both data management and data mining and assist which allow Boolean queries of text fields; (ii) standard knowledge discovery, the PIR databases are being migrated to sequence similarity search engines, including Peptide Match, an object-relational database management system. A three-tier Pattern Match, BLAST, FASTA, Pairwise Alignment and network computing, architecture provides a framework for Multiple Alignment; and (iii) advanced search engines that distributed object computing and Java-based WWW interfaces combine sequence similarity and annotation searches or connect with the database server for database query and update evaluate gene family relationships, including Annotation- tasks. Sorted Similarity Search, Domain Search, Global and Domain Nucleic Acids Research, 2000, Vol. 28, No. 1 43 Table 2. PIR Search and Analysis systems Search engine Description Location Text/Entry Interactive searching of text fields, refine with multiple queries http://pir.georgetown.edu/pirwww/search/textpsd.html BLAST Sequence similarity search and analysis, graphical output interface http://pir.georgetown.edu/pirwww/search/similarity.html FASTA Sequence similarity search and analysis, graphical output interface http://pir.georgetown.edu/pirwww/search/fasta.html Pattern/Peptide Variety of pattern or peptide matching tools http://pir.georgetown.edu/pirwww/search/patmatch.html Pairwise alignments Alignments of PIR or user-supplied sequences using SSEARCH http://pir.georgetown.edu/pirwww/search/pairwise.html Multiple alignments Alignments of PIR or user-supplied sequences using CLUSTALW http://pir.georgetown.edu/pirwww/search/multaln.html PIR Annotation-Sorted search Displays of BLAST or FASTA matches sorted by user-selected http://pir.georgetown.edu/pirwww/search/pass.html annotation Domain search Domain sequence search using FASTA, graphical output interface http://pir.georgetown.edu/pirwww/search/domains.html Global and Domain search BLAST and FASTA searching of PSD for global and domain http://pir.georgetown.edu/pirwww/search/dmsim.html similarity with graphical display Integrated Environment for Convenient interface for entry retrieval and sequence and annotation http://pir.georgetown.edu/pirwww/search/piriesa.html Sequence Analysis searching GeneFIND Protein family classification by combining several search and http://pir.georgetown.edu/gfserver/genefind.html alignment tools and the ProClass database Similarity Search and GeneFIND. Sequence searching can be for local similarity. The results are ranked by the global score performed against the PSD, NRL_3D, PATCHX, FAMBASE, and show the extent of matches at both the global and domain ProClass and the combined PSD+PATCHX collections. Text levels. Any combination of complete sequences and domains and entry searching is provided for the PSD, NRL_3D, PIR-ALN can be selected and viewed in a multiple alignment. and RESID databases. The PIR Integrated Environment for Sequence Analysis provides an integrated environment for all above protein analysis Sequence search and alignment tasks, including sequence similarity search, pattern and peptide BLAST (13) and FASTA (14) searches for sequence similarity match, multiple sequence alignment and advanced PIR similarity are available for all sequence databases. The output of these searches, as well as for entry retrieval by unique superfamily, search engines employs a graphical interface showing location family, title, species, taxonomic group, domains or keywords. of hits within the query sequence and full-length alignments GeneFIND (18) provides protein family classification and generated by SSEARCH (15). Multiple or pairwise alignments information retrieval by combining several search/alignment of PSD or user-supplied sequences can be done using CLUSTALW tools and the ProClass database in a multi-level filter system, (16) or SSEARCH. PIR pattern or peptide matching programs including the MOTIFIND neural networks, BLAST search, can (i) match a query sequence against a database of regular SSEARCH sequence alignment, motif pattern matching, expressions (i.e., patterns); (ii) search a user-specified regular hidden Markov motif modeling and CLUSTALW multiple expression against a sequence database; or (iii) find an exact motif alignment. match for a user-specified peptide sequence in one of the sequence databases, including the ARCHIVE database of ‘as published’ sequences. AVAILABILITY PIR provides free public access to value-added protein PIR Similarity Search system information through its WWW site at http://pir.georgetown. Combining sequence and annotation search, the Annotation- edu and direct file transfer at ftp://nbrfa.georgetown.edu/pir . In Sorted Similarity Search facility displays BLAST or FASTA addition to the databases (Table 1) and search tools (Table 2), matches along with the user-selected annotation (superfamily, the PIR WWW site also provides associated metadata, family, species, taxonomic group, keyword or all five) in the including technical bulletins and documentation that serves as annotation-sorted order. The matched entries can be selected metadata dictionaries for the PIR-International PSD. Accessible for multiple alignments against the query sequence using from the PIR anonymous FTP site are PIR-International databases CLUSTALW and displayed using MView (17). and many other documents, files and software tools, including The Domain Similarity Search engine uses FASTA to search the weekly interim updates of the PSD (in NBRF format) and against domain sequences compiled from the PIR-International the corresponding sequence file (in FASTA format). The PIR- PSD, and displays the PSD entry and domain annotation with a International PSD quarterly releases (in both NBRF and graphical representation of the matched region with links to CODATA formats) are also available at the NCBI FTP server. domain alignments in PIR-ALN. The Global and Domain Other sites and data depositories do not always have the most Similarity Search uses BLAST to search the PSD for global similarity and FASTA to search the domain sequence collection recent quarterly release of the PSD. 44 Nucleic Acids Research, 2000, Vol. 28, No. 1 5. Abola,E.E., Manning,N.O., Prilusky,J., Stampf,D.R. and Sussman,J.L. ACKNOWLEDGEMENTS (1996) Res. Natl Stand. Technol., 101, 231–241. PIR is a registered mark of NBRF. The work at NBRF is 6. Srinivasarao,G.Y., Yeh,L.-S., Marzec,C.R., Orcutt,B.C. and Barker,W.C. (1999) Bioinformatics, 15, 382–390. supported by grant number P41 LM05978 from the National 7. Garavelli,J.S. (2000) Nucleic Acids Res., 28, 209–211 (this issue). Library of Medicine and by gifts from COMPAQ, Pfizer and 8. Wu,C., Xiao,C. and Huang,H. (2000) Nucleic Acids Res., 28, 273–276 Dupont. The work at MIPS is supported by the Federal (this issue). Ministry of Education, Science, Research and Technology 9. Mewes,H.W., Frishman,D., Haase,D., Kaps,A., Lemcke,K., Mannhaupt,G., Pfeiffer,F., Schüller C., Stocker,S. and Weil,B. (2000) (BMBF, FKZ 03311670, 01KW9703/7), the Max-Planck-Society Nucleic Acids Res., 28, 37–40 (this issue). and the European Commission (BIO4-CT96-0110, 0338,0558). 10. Dayhoff,M.O. (1976) Fed. Proc., 35, 2132–2138. 11. Dayhoff,M.O., McLaughlin,P.J., Barker,W.C. and Hunt,L.T. (1975) Naturwissenschaften, 62, 154–161. REFERENCES 12. Barker,W.C., Pfeiffer,F. and George,D. (1996) Methods Enzymol., 266, 59–71. 1. Dayhoff,M.O., Eck,R.V., Chang,M.A. and Sochard,M.R. (1965) Atlas of 13. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. Protein Sequence and Structure, Vol. 1. National Biomedical Research and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389–3402. Foundation, Silver Spring, MD. 14. Pearson,W.R. and Lipman,D.J. (1988) Proc. Natl Acad. Sci. USA, 85, 2. Dayhoff,M.O. (1979) Atlas of Protein Sequence and Structure,Vol.5, 2444–2448. Supplement 3. National Biomedical Research Foundation, 15. Smith,T.F. and Waterman,M.S. (1981) Adv. Appl. Math., 2, 482–489. Washington, DC. 16. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res., 3. Barker,W.C., George,D.G., Mewes,H.-W., Pfeiffer,F. and Tsugita,A. 22, 4673–4680. (1993) Nucleic Acids Res., 21, 3089–3092. 17. Brown,N.P., Leroy,C. and Sander,C. (1998) Bioinformatics, 14, 380–381. 4. Pattabiraman,N., Namboodiri,K., Lowrey,A. and Gaber,B.P. (1990) 18. Wu,C.H., Huang,H. and Shivakumar,S. (1999) Int. J. Artificial Protein Seq. Data Anal., 3, 387–405. Intelligence Tools, in press.

http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png

Nucleic Acids Research Oxford University Press

http://www.deepdyve.com/lp/oxford-university-press/the-protein-information-resource-pir-PdWi5Wnc55

The Protein Information Resource (PIR)

Loading next page...

References (11)

(1990)
Protein Seq. Data Anal
(1996)
Res. Natl Stand. Technol
(1996)
Methods Enzymol
(1981)
Adv. Appl. Math
M. Chang, M. Dayhoff, R. Eck, M. Sochard (1965)
Atlas of protein sequence and structure
(1988)
Proc. Natl Acad. Sci. USA
(2000)
Nucleic Acids Res
(1998)
Bioinformatics
(1999)
Int. J. Artificial Intelligence Tools
E. Abola, N. Manning, J. Prilusky, D. Stampf, J. Sussman (1996)
The Protein Data Bank: Current Status and Future Challenges
Journal of Research of the National Institute of Standards and Technology, 101
(1976)
Fed. Proc

Publisher: Oxford University Press
ISSN: 0305-1048
eISSN: 1362-4962
DOI: 10.1093/nar/28.1.41
Publisher site: See Article on Publisher Site

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

The Protein Information Resource (PIR)

The Protein Information Resource (PIR)

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

The Protein Information Resource (PIR)

The Protein Information Resource (PIR)

References (11)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies