Access the full text.
Sign up today, get DeepDyve free for 14 days.
References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.
Curr. Issues Mol. Biol. (2001) 3(3): 47-55. Connecting Biomolecular Knowledge in SWISS-PROT 47 SWISS-PROT: Connecting Biomolecular Knowledge Via a Protein Database Elisabeth Gasteiger*, Eva Jung, and Amos Bairoch Cross-References in SWISS-PROT SWISS-PROT group, Swiss Institute of Bioinformatics, SWISS-PROT (Bairoch et al., 2000) is a curated protein CMU, 1 rue Michel-Servet, 1211 Genève 4, Switzerland sequence database, which strives to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational Abstract modifications (PTM), variants, etc.), a minimal level of redundancy and high level of integration with other With the explosive growth of biological data, the databases. development of new means of data storage was needed. More and more often biological information SWISS-PROT Entry Format, a Sample Entry and Line is no longer published in the conventional way via a Types Implementing Cross-References publication in a scientific journal, but only deposited A SWISS-PROT entry is composed of different line types, into a database. In the last two decades these and each line is introduced by a two-letter code indicating databases have become essential tools for researchers the type of data following on that line (see Figure 1 for a in biological sciences. Biological databases can be sample entry). The first section of every SWISS-PROT classified according to the type of information they entry contains the entry name (ID), a unique primary contain. There are basically three types of sequence- accession number (AC), sometimes followed by several related databases (nucleic acid sequences, protein secondary accession numbers, and dates indicating when sequences and protein tertiary structures) as well as the entry was created and when its sequence and various specialized data collections. It is important to annotations were last updated (DT). The description line provide the users of biomolecular databases with a (DE) lists all names, including synonyms, under which the degree of integration between these databases as by protein has been known, and the GN line contains the nature all of these databases are connected in a name(s) of the gene(s) coding for it. The following section scientific sense and each one of them is an important contains taxonomic data about the organism from which piece to biological complexity. In this review we will the protein originates, in particular the organism name (OS), highlight our effort in connecting biological its classification in the taxonomic tree (OC) and a unique information as demonstrated in the SWISS-PROT taxonomy identifier (OX). The reference section (RN, RP, protein database. RX, RA, RT and RL lines) contains all literature references consulted for the annotation of the protein. The list of Data Integration Using Cross-References references includes not only publications of the sequence itself, but also articles detailing post-translational The current situation of a research scientist has been modifications, 3-D structure, polymorphisms etc. The described quite accurately by a quote from John Naisbitt, reference section is followed by the comment block (CC) saying that “W e are drowning in information, but starving containing textual information classified into different for knowledge”. The World Wide Web (Berners-Lee, 1999), “topics” and describing the protein’ s function, subcellular which immensely facilitated information exchange between localisation, post-translational modifications, association information providers and users, now offers the life science with diseases etc. community a wealth of easily accessible knowledge and Database cross-references are stored in the DR lines information. While clicking on hypertext links and thus and allow the user to access related information in other navigating between databases maintained around the databases. DR lines will be described in detail in the next world seems to be a technically easy task, the challenge section. The keyword section (KW line type) lists a number lies in extracting the complete and up-to-date information of terms from a controlled vocabulary, which can be used related to a research field from the hundreds of databases to retrieve subsets of the database. A very important part available. The user can be assisted in this task by the of a SWISS-PROT protein entry is the feature table (FT creators of information resources, who should attempt to lines), which contains information about interesting sites provide a system that allows scientists to rapidly and or domains within the protein sequence, for which positional efficiently consult all information pertinent to a given topic. information is known. The feature table describes events This is usually done by establishing cross-references from such as post-translational modifications, sequence variants each record in a database to related entries in other due to polymorphisms, domain structure, sequence databases. conflicts, etc. Each feature line consists of a feature key, start and end positions of the described feature in the precursor sequence, and the feature description. Finally, there is the amino acid sequence itself. *For correspondence. Email [email protected]; The SWISS-PROT database was the first biomolecular Tel. +41-22-7025050; Fax. +41-22-7025858. © 2001 Caister Academic Press 48 Gasteiger et al. ID APE_HUMAN STANDARD; PRT; 317 AA. AC P02649; DT 21-JUL-1986 (Rel. 01, Created) DT 21-JUL-1986 (Rel. 01, Last sequence update) DT 01-OCT-2000 (Rel. 40, Last annotation update) DE Apolipoprotein E precursor (Apo-E). GN APOE. OS Homo sapiens (Human). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. OX NCBI_TaxID=9606; RN [1] RP SEQUENCE FROM N.A. (VARIANT E3). RX MEDLINE=84185684; PubMed=6325438; RA Zannis V.I., McPherson J., Goldberger G., Karathanasis S.K., RA Breslow J.L.; RT “Synthesis, intracellular processing, and signal peptide of human RT apolipoprotein E.”; RL J. Biol. Chem. 259:5495-5499(1984). RN [2] RP SEQUENCE FROM N.A. (VARIANT E3). RX MEDLINE=84212473; PubMed=6327682; RA McLean J.W., Elshourbagy N.A., Chang D.J., Mahley R.W., Taylor J.M.; RT “Human apolipoprotein E mRNA. cDNA cloning and nucleotide sequencing RT of a new variant.”; RL J. Biol. Chem. 259:6498-6504(1984). … 15 references omitted … RN [18] RP X-RAY CRYSTALLOGRAPHY (2.25 ANGSTROMS) OF 41-184. RX MEDLINE=91289138; PubMed=2063194; RA Wilson C., Wardell M.R., Weisgraber K.H., Mahley R.W., Agard D.A.; RT “Three-dimensional structure of the LDL receptor-binding domain of RT human apolipoprotein E.”; RL Science 252:1817-1822(1991). RN [19] RP X-RAY CRYSTALLOGRAPHY (2.0 ANGSTROMS) OF 41-181. RX MEDLINE=96313129; PubMed=8756331; RA Dong L.-M., Parkin S., Trakhanov S.D., Rupp B., Simmons T., RA Arnold K.S., Newhouse Y.M., Innerarity T.L., Weisgraber K.H.; RT “Novel mechanism for defective receptor binding of apolipoprotein E2 RT in type III hyperlipoproteinemia.”; RL Nat. Struct. Biol. 3:718-722(1996). RN [20] RP X-RAY CRYSTALLOGRAPHY (1.85 ANGSTROMS) OF 22-165. RX MEDLINE=20306971; PubMed=10850798; RA Segelke B.W., Forstner M., Knapp M., Trakhanov S.D., Parkin S., RA Newhouse Y.M., Bellamy H.D., Weisgraber K.H., Rupp B.; RT “Conformational flexibility in the apolipoprotein E amino-terminal RT domain structure determined from three new crystal forms: RT implications for lipid binding.”; RL Protein Sci. 9:886-897(2000). CC -!- FUNCTION: APO-E MEDIATES BINDING, INTERNALIZATION, AND CATABOLISM CC OF LIPOPROTEIN PARTICLES. IT CAN SERVE AS A LIGAND FOR THE LDL(APO CC B/E) RECEPTOR AND FOR THE SPECIFIC APO-E RECEPTOR (CHYLOMICRON CC REMNANT) OF HEPATIC TISSUES. CC -!- SUBCELLULAR LOCATION: EXTRACELLULAR. CC -!- TISSUE SPECIFICITY: OCCURS IN ALL LIPOPROTEIN FRACTIONS IN PLASMA. CC IT CONSTITUTES 10-20% OF VERY LOW DENSITY LIPOPROTEINS (VLDL) AND CC 1-2% OF HIGH DENSITY LIPOPROTEINS (HDL). APOE IS PRODUCED IN MOST CC ORGANS. SIGNIFICANT QUANTITIES ARE PRODUCED IN LIVER, BRAIN, CC SPLEEN, LUNG, ADRENAL, OVARY, KIDNEY, AND MUSCLE. CC -!- PTM: SYNTHESIZED WITH THE SIALIC ACID ATTACHED BY O-GLYCOSIDIC CC LINKAGE AND IS SUBSEQUENTLY DESIALATED IN PLASMA. CC -!- POLYMORPHISM: THREE MAJOR ISOFORMS CAN BE RECOGNIZED, DESIGNATED CC E2, E3, AND E4, ACCORDING TO THEIR RELATIVE POSITION AFTER CC ISOELECTRIC FOCUSING. THE MOST COMMON VARIANT IS E3 AND IS PRESENT CC IN 40-90% OF THE POPULATION. CC -!- DISEASE: IN ADDITION TO THE INFLUENCE OF COMMON APOE VARIANTS ON CC LIPOPROTEIN METABOLISM IN HEALTHY INDIVIDUALS, APOE VARIANTS ARE CC ALSO ASSOCIATED WITH FAMILIAL DYSBETALIPOPROTEINEMIA (FD): CC INDIVIDUALS WITH TYPE III HYPERLIPOPROTEINEMIA, ARE CLINICALLY CC CHARACTERIZED BY XANTHOMAS, YELLOWISH LIPID DEPOSITS IN THE PALMAR CC CREASE THAT ARE PATHOGNOMONIC FOR FD, OR LESS SPECIFIC ON TENDONS CC AND ON ELBOWS. FD RARELY MANIFESTS BEFORE THE THIRD DECADE IN MEN. CC IN WOMEN, IT IS USUALLY EXPRESSED ONLY AFTER THE MENOPAUSE. THE CC VAST MAJORITY OF FD PATIENTS ARE HOMOZYGOUS FOR APOE2. FD HAS ALSO CC BEEN OBSERVED IN INDIVIDUALS HETEROZYGOUS FOR RARE APOE VARIANTS, CC IN THESE CASES FD IS MORE SEVERE. THE INFLUENCE OF APOE ON LIPID CC LEVELS IS OFTEN SUGGESTED TO HAVE MAJOR IMPLICATIONS FOR THE RISK Connecting Biomolecular Knowledge in SWISS-PROT 49 CC OF CORONARY ARTERY DISEASE (CAD). INDIVIDUALS CARRYING THE COMMON CC APOE4 VARIANT ARE AT HIGHER RISK OF CAD. CC -!- SIMILARITY: BELONGS TO THE APOA1 / APOA4 / APOE FAMILY. CC -!- DATABASE: NAME=HotMolecBase; NOTE=ApoE entry; CC WWW=”http://bioinformatics.weizmann.ac.il/hotmolecbase/entries/apoe.htm”. DR EMBL; M12529; AAB59518.1; -. DR EMBL; K00396; AAB59546.1; -. DR EMBL; M10065; AAB59397.1; -. DR EMBL; AF050154; AAD02505.1; -. DR PIR; A03093; LPHUE. DR PIR; JS0084; JS0084. DR PDB; 1LE2; 15-OCT-92. DR PDB; 1LE4; 15-OCT-92. DR PDB; 1LPE; 15-OCT-92. DR PDB; 1NFN; 27-JAN-97. DR PDB; 1NFO; 27-JAN-97. DR PDB; 1OEF; 07-DEC-96. DR PDB; 1OEG; 07-DEC-96. DR PDB; 1BZ4; 11-NOV-98. DR SWISS-2DPAGE; P02649; HUMAN. DR MIM; 107741; -. DR InterPro; IPR000074; -. DR Pfam; PF01442; Apolipoprotein; 1. KW Glycoprotein; Plasma; Lipid transport; HDL; VLDL; Chylomicron; KW Heparin-binding; Repeat; Signal; 3D-structure; Disease mutation; KW Polymorphism. FT SIGNAL 1 18 FT CHAIN 19 317 APOLIPOPROTEIN E. FT DOMAIN 158 168 LDL RECEPTOR BINDING (POTENTIAL). FT DOMAIN 162 165 HEPARIN-BINDING. FT DOMAIN 229 236 HEPARIN-BINDING. FT DOMAIN 80 255 8 X 22 AA APPROXIMATE TANDEM REPEATS. FT REPEAT 80 101 1. FT REPEAT 102 123 2. FT REPEAT 124 145 3. FT REPEAT 146 167 4. FT REPEAT 168 189 5. FT REPEAT 190 211 6. FT REPEAT 212 233 7. FT REPEAT 234 255 8. FT CARBOHYD 212 212 O-LINKED (GALNAC...). FT VARIANT 21 21 E -> K (IN E5-TYPE). FT /FTId=VAR_000645. FT VARIANT 31 31 E -> K (IN E5-TYPE AND E4 PHILADELPHIA). FT /FTId=VAR_000646. FT VARIANT 46 46 L -> P (IN E4 FREIBURG). FT /FTId=VAR_000647. FT VARIANT 60 60 T -> A (IN E3 FREIBURG). FT /FTId=VAR_000648. FT VARIANT 99 99 Q -> K (IN E5 FRANKFURT). FT /FTId=VAR_000649. FT VARIANT 102 102 P -> R (IN E5-TYPE; NO HYPERLIPIDEMIA). FT /FTId=VAR_000650. FT VARIANT 117 117 A -> T (IN E3*). FT /FTId=VAR_000651. … several variants omitted … FT VARIANT 292 292 R -> H (IN E4 P.D.). FT /FTId=VAR_000671. FT VARIANT 314 314 S -> R (IN E4 H.G.). FT /FTId=VAR_000672. FT HELIX 43 59 FT TURN 60 60 FT HELIX 63 70 FT HELIX 73 96 FT TURN 97 98 FT HELIX 105 141 FT TURN 142 145 FT HELIX 149 180 FT TURN 181 182 SQ SEQUENCE 317 AA; 36154 MW; 91AFC04210A30689 CRC64; MKVLWAALLV TFLAGCQAKV EQAVETEPEP ELRQQTEWQS GQRWELALGR FWDYLRWVQT LSEQVQEELL SSQVTQELRA LMDETMKELK AYKSELEEQL TPVAEETRAR LSKELQAAQA RLGADMEDVC GRLVQYRGEV QAMLGQSTEE LRVRLASHLR KLRKRLLRDA DDLQKRLAVY QAGAREGAER GLSAIRERLG PLVEQGRVRA ATVGSLAGQP LQERAQAWGE RLRARMEEMG SRTRDRLDEV KEQVAEVRAK LEEQAQQIRL QAEAFQARLK SWFEPLVEDM QRQWAGLVEK VQAAVGTSAA PVPSDNH // Figure 1. SWISS-PROT entry P02649, Human apolipoprotein E precursor (Apo-E). 50 Gasteiger et al. Figure 2. SWISS-PROT and cross-references to other databases. The different types of data repositories are shown to which SWISS-PROT has established links in the description (DE) line, reference cross-reference (RX) line or database cross-reference (DR) lines: [1] DNA sequence: DDBJ (Tateno et al., 2000); EMBL (Stoesser et al., 2001); GenBank (Benson et al., 2000). [2] Genomics (species specific): DictyDb (Smith et al., 1997); EcoGene (Rudd, 2000); FlyBase (The FlyBase consortium, 1999); GeneCards (Rebhan et al., 1998); GeneCensus (Gerstein, 1998); HIV (Kuiken et al., 1999); HUGE (Kikuno R. et al., 2000); MaizeDB (Polacco et al., 1999); Mendel (Price et al., 2001); MGD (Blake et al., 2001); Micado (Perriere et al., 1999); NRSUB (Perriere et al., 1998); MIM (Wheeler et al., 2001); SGD (Ball et al., 2001); StyGene (Sanderson et al., 1995); SubtiList (Moszer, 1998); TIGR (Quackenbush et al., 2001); TubercuList (Cole, 1999); WormPep (Sonnhammer et al., 1997); YPD (Costanzo et al., 2001); ZFIN (Sprague et al., 2001). [3] Proteomics: ECO2DBASE (VanBogelen et al., 1999); HSC-2DPAGE (Evans et al., 1997); MAIZE-2DPAGE (Touzet et al., 1996); SWISS-2DPAGE (Hoogland et al., 2000); Aarhus/Ghent-2DPAGE (Celis et al., 1998); YEPD (Latter et al., 1995). [4] 3D structure: HSSP (Dodge et al., 1998); PDB (Bhat et al., 2001); PRESAGE (Brenner et al., 1999) SWISS- 3DIMAGE (Peitsch et al., 1995). [5] Post-translational modification: CarbBank (Doubet et al., 1989); GlycoSuite (Cooper et al., 2001). [6] Protein family and domain: BLOCKS (Henikoff et al., 2000); DOMO (Gracy et al., 1998); InterPro (Apweiler et al., 2000); Pfam (Bateman et al., 2000); PRINTS (Attwood et al., 2000); ProDom (Corpet et al., 2000); PROSITE (Hofmann et al., 1999); ProtoMap (Yona et al., 2000). [7] Specific protein families: GCRDb (Kolakowski, 1994); GPCRDB (Horn et al., 2001); IMGT (Lefranc, 2001); MEROPS (Rawlings et al., 2000); NucleaRDB (Horn et al., 2001); REBASE (Roberts et al., 2001). [8] Protein-Protein Interaction: DIP (Xenarios et al., 2001). [9] Enzymes and methabolic pathways: EcoCyc (Karp et al., 2000); ENZYME (Bairoch, 2000); MEROPS (Rawlings et al., 2000); REBASE (Roberts et al., 2001). [10] Literature: PubMed, Medline (Wheeler et al., 2001). main index. We will explain this concept in a later database to include cross-references in its entries – long subsection. before the advent of the World Wide Web, which made navigation between data resources distributed all over the DR Lines planet become second nature to all its users. There are five different types of cross-references available in SWISS- The DR (Database cross-Reference) lines are used as pointers to information related to SWISS-PROT entries and PROT: explicit and implicit cross-references in the DR lines, found in data collections other than SWISS-PROT (see URL addresses under the comment (CC) topic Figure 2). The full list of all databases to which SWISS- “DATABASE”, and cross-references departing from certain PROT is cross-referenced can be found in the document key types in the feature table (FT). Finally, the Medline/ PubMed (Wheeler et al., 2001) identifiers of literature file dbxref.txt (http://www.expasy.ch/cgi-bin/lists?dbxref.txt). For example, for a sequence translated from a references are stored in RX (Reference Cross(X)reference) nucleotide sequence there will be DR line(s) pointing to lines and thus allow direct access to these literature the relevant entri(es) in the EMBL/GenBank/DDBJ databases. There are a number of other annotation items database (Stoesser et al., 2001; Benson et al., 2000; Tateno in SWISS-PROT that might also be termed cross- references and that are, in the World Wide Web version, et al., 2000), which correspond to the DNA or RNA sequence(s) from which it was translated. If the X-ray enhanced with active hypertext links, namely scientific crystallographic atomic coordinates of a sequence are journal references (RL lines), taxonomy identifier (OX lines) stored in the Protein Data Bank (PDB) (Bhat et al., 2001), or enzyme classification numbers (DE lines). These there will be DR line(s) pointing to the corresponding different types of cross-references will be described in more detail in subsequent subsections. entri(es) in that database. In addition to cross-references provided by SWISS- Explicit and Implicit Links PROT itself, SWISS-PROT also plays an important role The database cross-references in DR (Database cross- for federated 2D-PAGE databases (Appel et al., 1996), Reference) lines available from the Web version of SWISS- which achieve much of the integration of data located and maintained at different sites through SWISS-PROT as their PROT on ExPASy (example: http://www.expasy.ch/cgi-bin/ Connecting Biomolecular Knowledge in SWISS-PROT 51 of the referenced databases. Databases cross-referenced Table 1. Databases explicitly referenced in SWISS-PROT DR lines. Abbreviations used: Database identifier: short identifier as used in SWISS- via explicit links have their own system of unique identifiers, PROT DR lines; Nb. entries: total number of SWISS-PROT entries in release which distinguishes them from the resources referenced 39.14 with cross-references to this database; Nb. DR lines: Total number via implicit links, as explained in the following subsection. of DR lines linking to this database in SWISS-PROT release 39.14. The full SWISS-PROT release 39.14 of 21/02/2001 consists names corresponding to the database identifiers used here can be extracted from the reference list at the end of this article, and are listed in the SWISS- of 93,407 protein sequence entries, which contain 564,764 PROT document http://www.expasy.ch/cgi-bin/lists?dbxref.txt. Databases explicit DR lines. Table 1 lists the 34 databases referenced cross-referenced in TrEMBL are highlighted in grey. in this manner, sorted by decreasing total number of Database identifier No. entries No. DR lines SWISS-PROT entries linked to each of these databases. The absolute numbers shown in this table will of course EMBL 87502 158541 be already obsolete at time of publication as table 1 is InterPro 66712 97619 merely a snapshot of release 39.14 – however, it is Pfam 64205 76620 PROSITE 49062 75151 important to note that each SWISS-PROT entry contains HSSP 24705 24705 an average of 6 explicit cross-references. It is further PRINTS 24141 30638 noteworthy that those entries with cross-references to TIGR 6281 6301 EMBL, i.e. those derived from nucleic acid sequences, MIM 5424 6046 SGD 4799 4847 contain an average of 1.81 DR EMBL lines. This illustrates EcoGene 4046 4048 SWISS-PROT’s high emphasis to reduce redundancy and MGD 3917 3928 to merge entries describing the same protein. TrEMBL PDB 2971 10073 Mendel 2826 2915 (Bairoch et al., 2000) is a computer-annotated supplement SubtiList 2207 2208 to SWISS-PROT which, by definition, lacks much of the MEROPS 2110 2199 high quality annotation. However, as far as cross- WormPep 2005 2039 FlyBase 1785 1833 references are concerned, many of the above-mentioned TubercuList 1235 1260 principles for SWISS-PROT also apply to TrEMBL. GCRDb 972 1661 Numerous types of explicit cross-references can indeed TRANSFAC 970 1052 be built automatically (often in collaboration with the StyGene 754 755 SWISS-2DPAGE 733 734 database to which the link is established), and a large MaizeDB 397 401 number of DR lines are systematically added and kept up HIV 354 370 to date in TrEMBL. These databases are highlighted in REBASE 306 308 DictyDb 303 306 grey in Table 1. ECO2DBASE 299 351 GlycoSuiteDB 198 198 Implicit Links ZFIN 138 138 The ExPASy server further enhances the database YEPD 120 129 Aarhus/Ghent 2DPAGE 98 128 interoperability offered by the explicit links by automatically HSC-2DPAGE 85 85 adding, where appropriate, some so-called “implicit” links. MAIZE-2DPAGE 39 39 As opposed to databases referenced via explicit links, the CarbBank 21 41 data collections linked in this manner usually do not have their own system of unique identifiers; however, they can be referenced via identifiers such as SWISS-PROT accession numbers, gene names, EMBL accession niceprot.pl?P00750 or http://www.expasy.ch/cgi-bin/get- numbers, etc. Two broad categories of implicit links exist: sprot-entry?P00750) include both explicit and implicit cross- a) Various databases have been developed in the last references. ten years that are completely based on SWISS-PROT (and Explicit Links sometimes also TrEMBL) and offer a specific analytical view of the database. For example, the ProDom (Corpet Typically, a SWISS-PROT entry will have cross-references et al., 2000) and DOMO (Gracy et al., 1998) databases to its parent DNA sequence(s), to a genomic database (MIM provide automatically derived domain views of each protein (Wheeler et al., 2001), MGD (Blake et al., 2001), FlyBase (The FlyBase consortium, 1999), SGD (Ball et al., 2001), in SWISS-PROT; the ProtoMap (Yona et al., 2000) etc.), to information detailing its three-dimensional (3D) database is a hierarchical classification of all SWISS-PROT proteins. In such cases, it is straightforward to add implicit structure (PDB), etc. All these cross-references are stored links. There should be, for each SWISS-PROT entry, a in the SWISS-PROT flat file, generally in the form “DR corresponding entry in these derived databases, and such database name; primary identifier; secondary identifier.” and are termed “explicit”. The primary identifier is an an entry can be accessed using the primary accession unambiguous pointer to the information entry in the number of the SWISS-PROT entry. b) There are specialized databases, which are supposed database to which reference is made; for most databases, to share some form of “identifiers” with SWISS-PROT . A this corresponds to the (unique) accession number of the typical example is GeneCards (Rebhan et al., 1998), a remote entry. The secondary identifier is generally used to complement the information given by the first identifier. database containing information on human genes. It can Examples for secondary identifiers can be entry names or be accessed using the HUGO (Human Genome Organization) approved symbol for a relevant gene. release numbers. The SWISS-PROT user manual (http:// Because SWISS-PROT also uses, as the first name listed www.expasy.ch/txt/userman.txt) provides a detailed on the GN line, the HUGO approved gene symbol, it is explanation of primary and secondary identifiers for each 52 Gasteiger et al. possible to automatically generate links between SWISS- In SWISS-PROT release 39.14, 131 different PROT and GeneCards. resources were referenced in this way, from 440 CC In both cases, no extra DR line for such data collections DATABASE topics in 411 entries. is added to the SWISS-PROT flat file. Indeed, such DR would take up “space” without really adding information: FT Lines Knowing that information related to every human SWISS- Certain specialized protein-related databases have entries PROT entry can be found in GeneCards, and that this that do not correspond to the complete protein sequence information is accessible by searching GeneCards for the described by a SWISS-PROT entry, but rather to a SWISS-PROT accession number, an explicit line “DR particular sub-sequence or even just one amino acid. GeneCards; P00001.” is redundant and will not provide Examples are databases specializing in certain types of the user with pertinent new information. In the Web version post-translational modifications of proteins, or in mutations. however, a hypertext link is useful and allows the user to The section of SWISS-PROT that contains position-specific navigate directly to the related information provided by the annotation being the feature table (FT), it makes sense to remote data collection. construct links directly from relevant feature table lines to Whilst the distinction between explicit and implicit links the corresponding information in other databases. The may not seem very important to a user querying SWISS- feature identifiers (FTId) in FT VARIANT lines of human PROT through the World Wide Web, as both types of links sequence entries for example allow to refer to a sequence are clickable, there are two noteworthy issues: variation in a unique and stable manner, and serve as Firstly, if you look or print an entry from the Web server, anchors for specifically directed links. A federated single it will contain lines that do not exist in the version distributed human mutation database (HmutDB; http:// with and accessible through various software packages or www2.ebi.ac.uk/mutaions/central/proposal.html) has been from other Web servers; ExPASy and the EBI servers are proposed, and the complete set of all FT VARIANT lines the only ones to provide most of the implicit links described has been indexed for SRS at EBI (http://srs.ebi.ac.uk/), above. Secondly, such automatically generated links can under the name SWISSCHANGE. The database fail in some cases. For example, a new SWISS-PROT entry SWISSCHANGE can be queried by SWISS-PROT FTIds. may not yet have a corresponding entry in a derived In the future, the same principle will be used to further database. Or, to take the example of GeneCards, it could enhance the links to GlycoSuiteDB (Cooper et al., 2001). happen that the gene symbol has not yet been GlycoSuiteDB is an annotated database of glycan “synchronized” (either GeneCards has updated a gene structures. For human Alpha-2-HS-glycoprotein (P02765) name before SWISS-PROT or the reverse). for example, GlycoSuiteDB provides detailed information A growing number of databases (currently 19) can be about two different oligosaccharide structures attached to accessed via implicit links in the SWISS-PROT version five different sites within the sequence. In addition to displayed on the ExPASy server. The documentation file providing a global link to GlycoSuiteDB by creating an dbxref.txt (http://www.expasy.ch/cgi-bin/lists?dbxref.txt) explicit DR line in SWISS-PROT entry P02765, we will lists URLs for all of these databases and details the rules create unique feature identifiers for each of the 5 FT and criteria used for the construction of the implicit links. CARBOHYD lines, which will allow direct access to the corresponding glycan structures. CC Lines While the DR lines provide access to general databases RX Lines relating to a large number of SWISS-PROT entries, there For each published reference cited in SWISS-PROT, for are hundreds of resources, databases and data collections which an entry in the literature databases Medline/PubMed available on the World Wide Web that specialize in one exists, the reference block of the SWISS-PROT entry specific protein or protein family, or provide detailed contains an RX line providing the Medline and PubMed information about mutations or polymorphisms. A identifiers. This allows quick and direct access to the comprehensive and regularly updated list of such resources abstract of the publication. is available at http://www.expasy.ch/alinks.html. These data collections are usually extremely heterogeneous, and not Other Cross-References necessarily very structured, and are generally accessed When looking at a SWISS-PROT entry on ExPASy (e.g. via one central URL rather than by accession number. http://www.expasy.ch/cgi-bin/get-sprot-entry?P00750), one Cross-references to such resources are provided in immediately notices the large number of hypertext links SWISS-PROT CC (comments) lines, under the topic (clickable text portions displayed in blue and underlined, ‘DATABASE’. The syntax of the ‘DATABASE’ topic is: unless the Web browser is configured to display them differently). Many of them are links to resources local to “CC -!- DATABASE: NAME=Text[; NOTE=Text][; WWW=”Address”][; FTP=”Address”].” ExPASy, but there are a number of annotation items that might, even if they do not belong to any of the cross- where ‘NAME’ is the name of the database; ‘NOTE’ reference types described earlier, also be termed cross- (optional) is a free text note; ‘WWW’ (optional) is the WWW references. Two examples are the taxonomy identifier address (URL) of the database; ‘FTP’ (optional) is the (Tax_ID), the unique identifier of each organism in the NCBI anonymous FTP address (including the directory name) taxonomy classification (Wheeler et al., 2001) (in the OX where the database file(s) are stored. An example for such line of each SWISS-PROT entry), and the Enzyme a cross-reference can be seen in Figure 1. Classification (EC) numbers (Bairoch, 2000) which can be Connecting Biomolecular Knowledge in SWISS-PROT 53 found in SWISS-PROT description (DE) lines of relevant Conclusions protein sequence entries. Another example are the coordinates (journal name, In this paper we illustrated our efforts to integrate volume and page numbers and year of publication) for biomolecular knowledge in our protein database SWISS- scientific journal references, which can be categorized as PROT. While aiming at providing as much annotation both explicit and implicit links: Explicit, because the RL information on the protein as possible, we place a strong line of the SWISS-PROT flat file allows the user to find the emphasis on integration with other biomolecular data published article, either the paper copy at a library, or the repositories. Needless to say that in the era of proteomics electronic version through the journal’s web site. And we are already flooded with an increasing amount of data implicit, because the ExPASy Web version of SWISS- resulting from the analysis of the complex readout of PROT adds some extra value to this hard-coded reference, genomes. Therefore we predict that there will be a lot more by providing a hypertext link to the publisher site, using spezialised data repositories established in the future, and some additional information such as the date or volume we hope to collaborate with these information resources. number of the first issues available on-line. We plan to Cross-references in SWISS-PROT allow users with a further enhance these links in the future, by encouraging specific interest in a particular biomolecular field to access publishers to link back to SWISS-PROT (and thus establish information gathered by experts in their domain, whilst bi-directional links), and by creating more stable links using individual SWISS-PROT entries are not overloaded with ISSN numbers and Digital Object Identifiers (DOI; see http:/ spezialised information and remain easily comprehensible /www.doi.org/). to the general interest user. SWISS-PROT as a Common Index for Federated References Databases With data exchange becoming more and more convenient, Appel, R.D., Bairoch, A., Sanchez, J.-C., Vargas, J.R., scientists can easily collaborate via the Internet, which can Golaz, O., Pasquali, C. and Hochstrasser, D.F. 1996. often result in very powerful projects combining the Federated 2-DE database: a simple means of publishing expertise of all participants. Instead of creating one central 2-DE data. Electrophoresis 17: 540-546. database for a specific topic, several independent, Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., separately and heterogeneously maintained databases can Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., be joined using the concept of federated databases. This Croning, M.D., Durbin, R., Falquet, L., Fleischmann, W., principle has been successfully applied to the field of Gouzy, J., Hermjakob, H., Hulo, N., Jonassen, I., Kahn, proteomics, where currently 15 federated 2D-PAGE D., Kanapin, A., Karavidopoulou, Y., Lopez, R., Marx, B., databases exist (http://www.expasy.ch/ch2d/2d- Mulder, N.J., Oinn, T.M., Pagni, M., Servant, F., Sigrist, index.html). These databases agreed to comply with a C.J. and Zdobnov, E.M. 2000. InterPro - an integrated number of rules (Appel et al., 1996), which are mainly based documentation resource for protein families, domains and on cross-references: Among other requirements, the functional sites. Bioinformatics 16: 1145-1150. database must be linked to other databases through active Attwood, T.K., Croning, M.D.R., Flower, D.R., Lewis, A.P., hypertext cross-references, which link together all related Mabey, J.E., Scordis, P., Selley, J.N. and Wright W. 2000. databases and combine them into one large virtual PRINTS-S: the database formerly known as PRINTS. database. In addition to these hypertext links between Nucleic Acids Res. 28: 225-227. federated databases, a main index has to be supplied that Bairoch, A. 2000. The ENZYME database in 2000. Nucleic provides a means of querying all databases through one Acids Res. 28: 304-305. unique entry point. Bi-directional cross-references must Bairoch, A. and Apweiler, R. 2000. The SWISS-PROT exist between the main index and the other databases. protein sequence database and its supplement TrEMBL SWISS-PROT currently acts as this main index. in 2000. Nucleic Acids Res. 28: 45-48. This concept of database federation could also be Ball, C.A., Jin, H., Sherlock, G., Weng, S, Matese, J.C., applied to other fields of specialisation. One might imagine Andrada, R., Binkley, G., Dolinski, K., Dwight, S.S., Harris, federated databases for protein post-translational M.A., Issel-Tarver, L., Schroeder, M., Botstein, D. and modifications (where GlycoSuiteDB could be considered Cherry, J.M. 2001. Saccharomyces Genome Database as the first such federated database), or for polymorphisms provides tools to survey gene expression and functional and mutations. Both these subjects are extremely complex, analysis data. Nucleic Acids Res. 29: 80-81. and require expertise that is probably best shared between Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Howe, K.L. centers of competence around the planet rather than and Sonnhammer, E.L. 2000. The Pfam protein families centralized in one single database. The possibility of database. Nucleic Acids Res. 28: 263-266. creating links from such specific entities as SWISS-PROT Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., feature table lines opens a large potential of database Rapp, B.A. and Wheeler, D.L. 2000. GenBank. Nucleic interoperability where SWISS-PROT can serve as a Acids Res. 28: 15-18. common index. We highly encourage any interested Berners-Lee, T. 1999. Weaving the Web. Harper, San database provider, in particular those specializing in post- Francisco. translational modifications, to collaborate with us in order Bhat, T.N., Bourne, P., Feng, Z., Gilliland, G., Jain, S., to provide users with an even more comprehensive view Ravichandran, V., Schneider, B., Schneider, K., Thanki, of all data available for their protein of interest. N., Weissig, H., Westbrook, J.and Berman, H.M. 2001. 54 Gasteiger et al. The PDB data uniformity project. Nucleic Acids Res. 29: and Pellegrini-Toole, A. 2000. The EcoCyc and MetaCyc 214-218. databases. Nucleic Acids Res. 28: 56-59. Blake, J.A., Eppig, J.T., Richardson, J.E., Bult, C.J. and Kikuno, R., Nagase, T., Suyama, M., Waki, M., Hirosawa, Kadin, J.A. 2001. The Mouse Genome Database (MGD): M., Ohara, O. 2000. HUGE: a database for human large integration nexus for the laboratory mouse. Nucleic Acids proteins identified in the Kazusa cDNA sequencing Res. 29: 91-94. project. Nucleic Acids Res 28: 331-332. Brenner, S.E., Barken, D. and Levitt, M. 1999. The Kolakowski, L.F. 1994. GCRDb: a G-protein-coupled PRESAGE database for structural genomics. Nucleic receptor database. Receptors Channels. 2: 1-7. Acids Res. 27: 251-253. Kuiken, C.L., Foley, B., Hahn, B., Korber, B., McCutchan, Celis, J.E., Ostergaard, M., Jensen, N.A., Gromova, I., F., Marx, P.A., Mellors, J.W., Mullins, J.I., Sodroski, J. Rasmussen, H.H. and Gromov, P. 1998. Human and and Wolinksy, S. 1999. Human Retroviruses and AIDS mouse proteomic databases: novel resources in the 1999: A Compilation and Analysis of Nucleic Acid and protein universe. FEBS Lett. 430: 64-72. Amino Acid Sequences. In: Theoretical Biology and Cole, S.T. 1999. Learning from the genome sequence of Biophysics Group, Los Alamos National Laboratory, Los Mycobacterium tuberculosis H37Rv. FEBS Lett. 452: 7- Alamos, NM. 10. Latter, G.I., Boutell, T., Monardo, P.J., Kobayashi, R., Cooper, C.A., Harrison, M.J., Wilkins, M.R. and Packer, Futcher, B., Mclaughlin, C.S.and Garrels, J.I. 1995. A N.H. 2001. GlycoSuiteDB: a new curated relational Saccharomyces cerevisiae Internet protein resource now database of glycoprotein glycan structures and their available. Electrophoresis. 16: 1170-1174. biological sources. Nucleic Acids Res. 29: 332-335. Lefranc, M.P. 2001. IMGT, the international Corpet, F., Servant, F., Gouzy, J. and Kahn, D. 2000. ImMunoGeneTics database. Nucleic Acids Res. 29: 207- ProDom and ProDom-CG: tools for protein domain 209. analysis and whole genome comparisons. Nucleic Acids Moszer, I. 1998. The complete genome of Bacillus subtilis: Res. 28: 267-269. from sequence annotation to data management and Costanzo, M.C., Crawford, M.E., Hirschman, J.E., Kranz, analysis. FEBS Lett. 430: 28-36. J.E., Olsen, P., Robertson, L.S., Skrzypek, M.S., Braun, Perriere, G., Bessieres, P. and Labedan, B. 1999. The B.R., Hopkins, K.L., Kondu, P., Lengieza, C., Lew-Smith, Enhanced Microbial Genomes Library. Nucleic Acids Res. J.E., Tillberg, M. and Garrels, J.I. 2001. YPD, PombePD 27: 63-65. and WormPD: model organism volumes of the Perriere, G., Gouy, M. and Gojobori, T. 1998. The non- BioKnowledge library, an integrated resource for protein redundant Bacillus subtilis (NRSub) database: update information. Nucleic Acids Res. 29: 75-79. 1998. Nucleic Acids Res. 26: 60-62. Dodge, C., Schneider, R. and Sander, C. 1998. The HSSP Peitsch, M.C., Wells, T.N., Stampf, D.R. and Sussman, J.L. database of protein structure-sequence alignments and 1995. The Swiss-3DImage collection and PDB-Browser family profiles. Nucleic Acids Res. 26: 313-315. on the World-Wide Web. Trends Biochem. Sci. 20: 82- Doubet, S., Bock, K., Smith, D., Darvill, A. and Albersheim, 84. P. 1989. The Complex Carbohydrate Structure Database. Polacco, M., Chen, S., Coe, E., Hancock, D.C., Kross, H., Trends Biochem. Sci. 14: 475-477. Schroeder, S. and Vargo, C. 1999. MaizeDB: integrated Evans, G., Wheeler, C.H., Corbett, J.M. and Dunn, M.J. maize genome resource. Community curation, database 1997. Construction of HSC-2DPAGE: a two-dimensional interoperability, and comparative map displays. Maize gel electrophoresis database of heart proteins. Genetics Conference Abstracts 41. Electrophoresis. 18: 471-479. Price, C.A. and Reardon, E.M. 2001. Mendel, a database Gerstein, M. 1998. How representative are the known of nomenclature for sequenced plant genes. Nucleic Acids structures of the proteins in a complete genome? A Res. 29: 118-119. comprehensive structural census. Fold Des. 3: 497-512. Quackenbush, J., Cho, J., Lee, D., Liang, F., Holt, I., Gracy, J. and Argos, P. 1998. DOMO: a new database of Karamycheva, S., Parvizi, B., Pertea, G., Sultana, R. and aligned protein domains. Trends Biochem. Sci. 23: 495- White, J. 2001. The TIGR Gene Indices: analysis of gene 497. transcript sequences in highly sampled eukaryotic Henikoff, J.G., Greene, E.A., Pietrokovski, S. and Henikoff, species. Nucleic Acids Res. 29: 159-164. S. 2000. Increased coverage of protein families with the Rawlings, N.D. and Barrett, A.J. 2000. MEROPS: the blocks database servers. Nucleic Acids Res. 28: 228-230. peptidase database. Nucleic Acids Res. 28: 323-325. Hofmann, K., Bucher, P., Falquet, L. and Bairoch, A. 1999. Rebhan, M., Chalifa-Caspi, V., Prilusky, J. and Lancet, D. The PROSITE database, its status in 1999. Nucleic Acids 1998. GeneCards: a novel functional genomics Res. 27: 215-219. compendium with automated data mining and query Hoogland, C., Sanchez, J.-C., Tonella, L., Binz, P.-A., reformulation support. Bioinformatics 14: 656-664. Bairoch, A., Hochstrasser, D.F. and Appel, R.D. 2000. Roberts, R.J. and Macelis, D. 2001. REBASE—restriction The 1999 SWISS-2DPAGE database update. Nucleic enzymes and methylases. Nucleic Acids Res. 29: 268- Acids Res. 28: 286-288. 269. Horn, F., Vriend, G. and Cohen, F.E. 2001. Collecting and Rudd, K.E. 2000. EcoGene: a genome sequence database harvesting biological data: the GPCRDB and NucleaRDB for Escherichia coli K-12. Nucleic Acids Res. 28: 60-64. information systems. Nucleic Acids Res. 29: 346-349. Sanderson, K.E., Hessel, A. and Rudd, K.E. 1995. Genetic Karp, P.D., Riley, M., Saier, M., Paulsen, I.T., Paley, S.M. map of Salmonella typhimurium, edition VIII. Microbiol. Connecting Biomolecular Knowledge in SWISS-PROT 55 Rev. 59: 241-303. Smith, D.W. and Loomis, W.F. 1997. DictyDB - A Genomic Database for Dictyostelium discoideum. In: Dictyostelium. A Model System for Cell and Developmental Biology. Y. Maeda, K. Inouyea and I. Takeuchi, eds. Universal Academic Press, Inc. - Tokyo, Japan. p. 471-477. Sonnhammer, E.L. and Durbin, R. 1997. Analysis of protein domain families in Caenorhabditis elegans. Genomics 46: 200-216. Sprague, J., Doerry, E., Douglas, S. and Westerfield, M. 2001. The Zebrafish Information Network (ZFIN): a resource for genetic, genomic and developmental research. Nucleic Acids Res. 29: 87-90. Stoesser, G., Baker, W., van den Broek, A., Camon, E., Garcia-Pastor, M., Kanz, C., Kulikova, T., Lombard, V., Lopez, R., Parkinson, H., Redaschi, N., Sterk, P., Stoehr, P. and Tuli, M.A. 2001. The EMBL nucleotide sequence database. Nucleic Acids Res. 29: 17-21. Tateno, Y., Miyazaki, S., Ota, M., Sugawara, H. and Gojobori, T. 2000. DNA Data Bank of Japan (DDBJ) in collaboration with mass sequencing teams. Nucleic Acids Res. 28: 24-26. The FlyBase Consortium. 1999. The FlyBase database of the Drosophila Genome Projects and community literature. Nucleic Acids Res. 27: 85-88. Touzet, P., Riccardi, F., Morin, C., Damerval, C., Huet, J.- C., Pernollet, J.-C., Zivy, M. and de Vienne, D. 1996. The maize two dimensional gel protein database: towards an integrated genome analysis program. Theor. Appl. Genet. 93: 997-1005. VanBogelen, R.A., Schiller, E.E., Thomas, J.D. and Neidhardt, F.C. 1999. Diagnosis of cellular states of microbial organisms using proteomics. Electrophoresis 20: 2149-2159. Wheeler, D.L., Church, D.M., Lash, A.E., Leipe, D.D., Madden, T.L., Pontius, J.U., Schuler, G.D., Schriml, L.M., Tatusova, T.A., Wagner, L. and Rapp, B.A. 2001. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 29: 11-16. Xenarios, I., Fernandez, E., Salwinski, L., Duan, X.J., Thompson, M.J., Marcotte, E.M. and Eisenberg, D. 2001. DIP: The Database of Interacting Proteins: 2001 update. Nucleic Acids Res. 29: 239-241. Yona, G., Linial, N. and Linial, M. 2000. ProtoMap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Res. 28: 49- 55. Further Reading Caister Academic Press is a leading academic publisher of advanced texts in microbiology, molecular biology and medical research. Full details of all our publications at caister.com • MALDI-TOF Mass Spectrometry in Microbiology Edited by: M Kostrzewa, S Schubert (2016) www.caister.com/malditof • Aspergillus and Penicillium in the Post-genomic Era Edited by: RP Vries, IB Gelber, MR Andersen (2016) www.caister.com/aspergillus2 • The Bacteriocins: Current Knowledge and Future Prospects Edited by: RL Dorit, SM Roy, MA Riley (2016) www.caister.com/bacteriocins • Omics in Plant Disease Resistance Edited by: V Bhadauria (2016) www.caister.com/opdr • Acidophiles: Life in Extremely Acidic Environments Edited by: R Quatrini, DB Johnson (2016) www.caister.com/acidophiles • Climate Change and Microbial Ecology: Current Research and Future Trends Edited by: J Marxsen (2016) www.caister.com/climate • Biofilms in Bioremediation: Current Research and Emerging Technologies • Flow Cytometry in Microbiology: Technology and Applications Edited by: G Lear (2016) Edited by: MG Wilkinson (2015) www.caister.com/biorem www.caister.com/flow • Microalgae: Current Research and Applications • Probiotics and Prebiotics: Current Research and Future Trends Edited by: MN Tsaloglou (2016) Edited by: K Venema, AP Carmo (2015) www.caister.com/microalgae www.caister.com/probiotics • Gas Plasma Sterilization in Microbiology: Theory, • Epigenetics: Current Research and Emerging Trends Applications, Pitfalls and New Perspectives Edited by: BP Chadwick (2015) Edited by: H Shintani, A Sakudo (2016) www.caister.com/epigenetics2015 www.caister.com/gasplasma • Corynebacterium glutamicum: From Systems Biology to • Virus Evolution: Current Research and Future Directions Biotechnological Applications Edited by: SC Weaver, M Denison, M Roossinck, et al. (2016) Edited by: A Burkovski (2015) www.caister.com/virusevol www.caister.com/cory2 • Arboviruses: Molecular Biology, Evolution and Control • Advanced Vaccine Research Methods for the Decade of Edited by: N Vasilakis, DJ Gubler (2016) www.caister.com/arbo Vaccines Edited by: F Bagnoli, R Rappuoli (2015) www.caister.com/vaccines • Shigella: Molecular and Cellular Biology Edited by: WD Picking, WL Picking (2016) www.caister.com/shigella • Antifungals: From Genomics to Resistance and the Development of Novel Agents Edited by: AT Coste, P Vandeputte (2015) • Aquatic Biofilms: Ecology, Water Quality and Wastewater www.caister.com/antifungals Treatment Edited by: AM Romaní, H Guasch, MD Balaguer (2016) www.caister.com/aquaticbiofilms • Bacteria-Plant Interactions: Advanced Research and Future Trends Edited by: J Murillo, BA Vinatzer, RW Jackson, et al. (2015) www.caister.com/bacteria-plant • Alphaviruses: Current Biology Edited by: S Mahalingam, L Herrero, B Herring (2016) www.caister.com/alpha • Aeromonas Edited by: J Graf (2015) www.caister.com/aeromonas • Thermophilic Microorganisms Edited by: F Li (2015) www.caister.com/thermophile • Antibiotics: Current Innovations and Future Trends Edited by: S Sánchez, AL Demain (2015) www.caister.com/antibiotics • Leishmania: Current Biology and Control Edited by: S Adak, R Datta (2015) www.caister.com/leish2 • Acanthamoeba: Biology and Pathogenesis (2nd edition) Author: NA Khan (2015) www.caister.com/acanthamoeba2 • Microarrays: Current Technology, Innovations and Applications Edited by: Z He (2014) www.caister.com/microarrays2 • Metagenomics of the Microbial Nitrogen Cycle: Theory, Methods and Applications Edited by: D Marco (2014) www.caister.com/n2 Order from caister.com/order
Current Issues in Molecular Biology – Multidisciplinary Digital Publishing Institute
Published: Jul 1, 2001
Keywords: n/a
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.