The COG database: new developments in phylogenetic classification of proteins from complete genomes

Roman L. Tatusov; Darren A. Natale; Igor V. Garkavtsev; Tatiana A. Tatusova; Uma T. Shankavaram; Bachoti S. Rao; Boris Kiryutin; Michael Y. Galperin; Natalie D. Fedorova; Eugene V. Koonin

doi:10.1093/nar/29.1.22

The COG database: new developments in phylogenetic classification of proteins from complete genomes

Tatusov, Roman L.;Natale, Darren A.;Garkavtsev, Igor V.;Tatusova, Tatiana A.;Shankavaram, Uma T.;Rao, Bachoti S.;Kiryutin, Boris;Galperin, Michael Y.;Fedorova, Natalie D.;Koonin, Eugene V. 2001-01-01 00:00:00 22–28 Nucleic Acids Research, 2001, Vol. 29, No. 1 The COG database: new developments in phylogenetic classification of proteins from complete genomes Roman L. Tatusov, Darren A. Natale, Igor V. Garkavtsev, Tatiana A. Tatusova, Uma T. Shankavaram, Bachoti S. Rao, Boris Kiryutin, Michael Y. Galperin, Natalie D. Fedorova and Eugene V. Koonin* National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA Received October 2, 2000; Accepted October 11, 2000 ABSTRACT typically from newly sequenced genomes, to pre-existing COGs. Here we describe the new developments in the COGs The database of Clusters of Orthologous Groups of database in the year 2000, which included both the quantitative proteins (COGs), which represents an attempt on a update through addition of new genomes and development of phylogenetic classification of the proteins encoded new functionalities associated with the database. in complete genomes, currently consists of 2791 COGs including 45 350 proteins from 30 genomes of THE CURRENT STATUS OF THE COGS—NEW bacteria, archaea and the yeast Saccharomyces GENOMES cerevisiae (http://www.ncbi.nlm.nih.gov/COG). In addition, a supplement to the COGs is available, in Since the second release of the COG database in January 2000 (3), nine new genomes have been added to the database using which proteins encoded in the genomes of two multi- the COGNITOR program with subsequent manual validation cellular eukaryotes, the nematode Caenorhabditis to identify new members of pre-existing COGs and previously elegans and the fruit fly Drosophila melanogaster, described procedures for the construction of new COGs. The and shared with bacteria and/or archaea were additions included the first sequenced genome of a crenarchaeon included. The new features added to the COG data- (representative of the second major division of the archaea), base include information pages with structural and Aeropyrum pernix; a fifth representative of the Euryarchaea, functional details on each COG and literature refer- Pyrococcus abyssi; and seven bacterial genomes, including ences, improvements of the COGNITOR program that those from unusual organisms such as the extremely radio- is used to fit new proteins into the COGs, and classi- resistant Deinococcus radiodurans (Table 1). The previously fication of genomes and COGs constructed by using described trend held with the new genomes in that 60–80% of principal component analysis. the proteins from each of the prokaryotic genomes could be included in COGs (Table 1). The genome of the crenarchaeon A.pernix (4), which was of INTRODUCTION particular interest because this major evolutionary lineage had not been previously represented among completely sequenced The database of Clusters of Orthologous Groups of proteins genomes, was investigated in detail as a benchmark for (COGs) has been incepted as a phylogenetic classification of annotation of newly sequenced genomes using the COG proteins from complete genomes (1). Each COG includes system (5). The COG analysis resulted in an ∼50% increase in proteins that are thought to be orthologous, i.e. connected confident functional prediction for A.pernix genes compared to through vertical evolutionary descent (2). Orthology may the original annotations. On the other hand, a significant involve not only one-to-one, but also, in cases of lineage- fraction of open reading frames (ORFs), originally annotated specific gene duplications, one-to-many and many-to-many as genes, did not show detectable similarity to any proteins in relationships (hence Orthologous Groups of proteins). The current databases, but overlapped with proteins included in the purpose of the COGs database is to serve as a platform for COGs, strongly suggesting that these ORFs were not real functional annotation of newly sequenced genomes and for genes (Table 2). Thus the analysis of the genome of an studies on genome evolution. To facilitate functional studies, the organism that had no close relatives among other organisms COGs have been classified into 17 broad functional categories, with sequenced genomes appears to corroborate the effective- including a class for which only a general functional prediction, ness of the COG system as a genome annotation tool. usually that of biochemical activity, was feasible and a class of uncharacterized COGs. Additionally, some of the COGs with Given the accumulation of multiple, complete genome known functions are organized to represent specific cellular sequences, we were interested in the growth dynamics of the systems and biochemical pathways. The database is accompanied COG set with the increased number of included genomes. The by the COGNITOR program, which assigns new proteins, growth curve was constructed by imitating the COG formation *To whom correspondence should be addressed. Tel: +1 301 435 5913; Fax: +1 301 480 9241; Email: [email protected] Nucleic Acids Research, 2001, Vol. 29, No. 1 23 Table 1. Representation of genomes in the COGs Species Total no. of encoded proteins No. of proteins assigned to COGs Proteins in COGs (%) Archaea Archaeoglobus fulgidus 2420 1817 75 Methanococcus jannaschii 1786 1301 74 Methanobacterium thermoautotrophicum 1873 1365 73 P.abyssi 1767 1430 81 Pyrococcus horikoshii 2080 1353 66 A.pernix 2722 1157 43 Bacteria Aquifex aeolicus 1560 1312 84 Bacillus subtilis 4118 2767 67 Borrelia burgdorferi 1637 693 43 Campylobacter jejuni 1634 1282 78 Chlamydia trachomatis 895 630 71 Chlamydia pneumoniae 1053 646 62 D.radiodurans 3194 2133 67 Escherichia coli 4285 3308 77 Haemophilus influenzae 1695 1497 88 Helicobacter pylori 1578 1070 68 Mycobacterium tuberculosis 3924 2456 63 Mycoplasma genitalium 471 374 79 Mycoplasma pneumoniae 680 419 62 Neisseria meningitidis 2081 1446 70 Pseudomonas aeruginosa 5567 4166 75 Rickettsia prowazekii 836 673 81 Synechocystis sp. 3168 2048 65 Thermotoga maritima 1858 1497 81 Treponema pallidum 1036 705 68 Vibrio cholerae 3828 2715 71 Ureaplasma urealyticum 613 398 64 Xylella fastidiosa 2766 1481 54 Eukaryotes S.cerevisiae 5964 2158 36 Total 68 571 45 350 66 Newly added genomes are underlined. The low fraction of proteins assigned to COGs is probably due to over-prediction of protein-coding genes in the original genome annotation (see text and Table 2) The low fraction of proteins assigned to COGs is due to the fact that part of the genome consists of multiple plasmids that code for poorly conserved proteins Table 2. Analysis of the predicted A.pernix proteins using the COG system Originally predicted Proteins assigned to COGs Original ORFs overlapping Predicted proteins after proteins Predicted function Function unknown with COG members COG analysis Number 2722 833 315 849 1843 %original gene set 100 31 12 32 68 % gene set after COG analysis 146 45 17 NA 100 NA, not applicable. for each of the 10 random orders of genome inclusion (Fig. 1). expected dynamics of the COG growth. Given that the number For each number of species, the maximum, the minimum and of completely sequenced genomes is still relatively small and the average number of COGs was determined. The minimal that some of them are closely related, it remains uncertain and the maximal curves define the area containing all possible whetherornot thenumberofCOGs is startingto approach growth curves (Fig. 1). The average curve approximates the saturation, and if it is, what is the asymptotic value. 24 Nucleic Acids Research, 2001, Vol. 29, No. 1 of archaea and eukaryotes; (ii) proteins encoded by genes that have been horizontally transferred from organelles to the eukaryotic nucleus or otherwise acquired by eukaryotes from bacteria (8). Analysis of the phylogenetic patterns in the COGs may help distinguish between these two categories. Table 3. Eukaryotic proteins in the COGs Functional category Eukaryotic proteins assigned to COGs S.cerevisiae C.elegans D.melanogaster Translation 276 221 270 Transcription 107 134 167 Replication and repair 165 186 159 Post-translational modification, 167 260 273 chaperone functions Cell division and chromosome 23 22 19 partitioning Cell motility and secretion 10 17 14 Cell envelope biogenesis, outer 29 62 47 membrane Inorganic ion transport 85 199 132 Figure 1. Growth dynamics of the COG set with the increase of number of Signal transduction included genomes. The circles show the sequence of genome inclusion according to the actual order of sequencing, and the smooth line shows the mean of 10 Energy production and conversion 116 138 183 random permutations of the genome order. The colored area indicates the Carbohydrate transport and 176 295 300 range between the maximal and minimal value for each point (number of 6 metabolism genomes) in 10 random permutations. Aminoacidtransport and 180 193 222 metabolism Nucleotide transport and 85 88 99 ADDING PROTEINS FROM MULTICELLULAR metabolism EUKARYOTES TO PROKARYOTIC COGs Coenzyme metabolism 86 63 69 Lipid metabolism 52 237 169 The current COG collection includes multiple bacterial and archaeal genomes and only one eukaryotic species, the yeast General function prediction only 344 673 635 Saccharomyces cerevisiae. Incorporating the larger genomes of Function unknown 53 60 84 multicellular eukaryotes into the COG system is a challenging Total 1954 2848 2842 task due to the preponderance of multidomain proteins in these organisms. As a first step toward this goal, we sought to identify eukaryotic proteins that fit into already existing COGs, in other After three distant eukaryotic genomes were included in the words, those eukaryotic proteins that have orthologs in at least prokaryotic COGs, it was of interest to analyze their co-occurrence. two prokaryotic species. To this end, 19 895 protein sequences As expected, the majority of COGs with eukaryotic members from the (nearly) complete genome of the nematode Caenorhab- include all three genomes; at the same time, a considerable ditis elegans (6) and 14 100 sequences from the genome of the number of COGs include all possible pairs of eukaryotic fruit fly Drosophila melanogaster (7) were analyzed using the genomes and each of the individual species (Table 4). These COGNITOR program, which assigns proteins to COGs on the observations, which will be analyzed in detail elsewhere, support basis of multiple genome-specific best hits and splits multi- domain protein into individual domains if these show affinity the major role of lineage-specific gene loss and horizontal gene with different COGs. After manual validation of the results, transfer in eukaryotic evolution. 20% of the D.melanogaster proteins and 14% of the C.elegans proteins were assigned to COGs; a significant number of Table 4. Co-occurrence of the eukaryotic genomes in the COGs proteins from each of the multicellular eukaryotes were included in COGs of each functional category, with the notable Numbers of COGs Eukaryotic species exception of ‘Cell division and chromosome partitioning’ and 578 C.elegans D.melanogaster S.cerevisiae ‘Cell motility and secretion’, which consist primarily of 99 C.elegans D.melanogaster prokaryote-specific proteins (Table 3). The COG analysis of the 38 C.elegans S.cerevisiae worm and fly proteins yielded numerous functional predictions, which have not been described previously (I.V.Garkavtsev and 46 C.elegans E.V.Koonin, unpublished observations). Eukaryotic proteins 77 D.melanogaster S cerevisiae that have orthologs in prokaryotes belong to two major catego- 44 D.melanogaster ries: (i) ancient proteins inherited from the last common 166 S.cerevisiae ancestor of all extant life forms or at least the common ancestor Nucleic Acids Research, 2001, Vol. 29, No. 1 25 Table 5. Detection of missed proteins using phylogenetic pattern analysis Species Number of previously undetected COGs including new proteins proteins assigned to COGs A.fulgidus 3 1143, 1255, 1698 M.jannaschii 4 0286, 0827, 1908, 1996 M.thermoautotrophicum 1 2888 P.abyssi 1 2888 P.horikoshii 15 1383, 1761, 1919, 1998, 2004 2051, 2075, 2092, 2093, 2097 2167, 2212, 2260, 2443, 2888 A.pernix 19 0640, 1522, 1605, 1694, 1848 1858, 2002, 2118, 2260, 2443 A.aeolicus 6 0254, 0255, 0690, 0858, 1828 B.subtilis 2 1582, 1863 C.trachomatis 1 1314 D.radiodurans 2 1863, 2120 H.influenzae 1 1826 H.pylori 1 0690 M.tuberculosis 1 0458 M.genitalium 1 0828 M.pneumoniae 3 0816, 0828, 1546 T.maritima 2 0230, 1886 T.pallidum 1 0268 DETECTING MISSED GENES NEW FEATURES ASSOCIATED WITH THE COGS One of the features associated with the COG database is the Improvement of the COGNITOR program—statistical analysis of phylogenetic patterns, i.e. the patterns of species evaluation of the fit that are represented or not represented in each of the COGs. The original COGNITOR program uses multiple genome-specific Unexpected phylogenetic patterns, for example, those that best hits (BeTs) as the only criterion for assigning new proteins contain all but one bacterial species or those that include only to COGs. In the new version, we introduced an estimate of the one of a pair of closely related species, may be due to omission probability that the query protein is assigned to the given COG of genes in genome annotations submitted to GenBank or to by chance. Under the assumption of uniform distribution of unusual evolutionary phenomena such as non-orthologous hits to each genome in the COG database, the probability of displacement of a nearly ubiquitous gene. Before considering one BeT into a particular COG is, simply, the fraction of the second hypothesis, the first one should be tested, and we proteins from the specified genome that belongs to the COG: undertook a systematic analysis of COGs with unexpected f =n /N ij ij i phylogenetic patterns in search of missing members (9). The where n is the number of proteins from species i in COG j and N ij i nucleotide sequence of the genome in question was searched is the total number of proteins in species i. Then, the probability of using the TNBLASTN program (10) and the sequences of exactly two BeTs into COG j is given by: members of the respective COGs as queries. As a result, p j=1/2Σf f Π (1 – f ) missing genes coding for members of 48 COGs were identified 2 ij kj lj l# i (Table 5); most of the predicted new proteins are small, which l# k explains why they have escaped the original genome annotations. Thus the COG system is instrumental in improving genome Similar expressions can be easily obtained for a different annotation not only with respect to functional predictions, but also number of BeTs. For each COG, we can compute p and find 2j for gene identification per se. the ‘average’ value of Fj that satisfies the equation: 26 Nucleic Acids Research, 2001, Vol. 29, No. 1 Figure 2. An example of a COG-Info page. proteins comprising the COG and the three-dimensional 2 (m–2) C(2,m)Fj (1 – Fj) =p 2j structure of the domains if known or predictable; a succinct where m is the number of species in COG j. Using Fj simplifies summary of the common structural and functional features of the calculation of the probability when the specified number of the COG members and peculiarities of individual members; BeTs is large. key references (Fig. 2). The COG-Info pages are currently at different stages of construction. COG-Info pages Classification of genomes on the basis of co-occurrence in In order to increase the utility of the COG system for genome COGs using principal component analysis annotation, a web page that contains additional structural and functional information on the COG as a whole and individual The data on the co-occurrence of genomes in COGs was used members is now associated with each COG. These hyperlinked as the input for classification by principal component analysis Info pages include: systematic classification of the COG (PCA). Briefly, the presence or absence of a given species in members under the current classification systems for enzyme each COG is converted into a 1/0 coordinate value in a multi- or transporters (if applicable); indications which COG dimensional space where each dimension corresponds to a members (if any) have been characterized genetically and COG, which results in a geometric representation of all included biochemically; information on the domain architecture of the species in the >2000-dimensional space. The PCA analysis is then Nucleic Acids Research, 2001, Vol. 29, No. 1 27 Figure 3. Classification of genome by co-occurrence in COGs using PCA. (A) All COGs. (B) Translation, transcription and replication (functional categories J, K and L). (C) Metabolism (functional categories C, E, F, G, H and I). used to choose the subspace of lower dimensionality for visual list of all COGs hyperlinked to individual COG pages; COGs organized by functional category; COGs organized by examination. The eigenvector decomposition yields the functional complexes and pathways; an interactive matrix of orthogonal courses in the space and the corresponding co-occurrence of genomes in COGs; a phylogenetic pattern eigenvectors constitute the spread of the objects. The WWW search tool; a principal component classification tool; interface provides tools for selection of the subspace, the COGNITOR; a COG Help page. Each of the individual COG species to view and the COGs to use for classification pages is hyperlinked to: (i) pictorial representations of BLAST (Fig. 3A). Significantly different results were obtained when search outputs for each member of the COG, which also different functional categories of COGs were analyzed. include links to the respective GenBank and Entrez-Genomes Specifically, the combined categories of translation, transcription entries, (ii) a multiple alignment of the COG members and replication showed a sharp separation between bacteria, produced automatically by using the ClustalW program, (iii) a archaea and eukaryotes, with representatives of each of these COG-Info page (reached by clicking on the COG number). primary domains of life forming a tight cluster (Fig. 3B); the The supplement to the COGs, which shows proteins from metabolic functions produced a more complex picture, with a C.elegans and D.melanogaster assigned to each COG is accessible separation of free-living and parasitic bacteria and grouping of at http://www.ncbi.nlm.nih.gov/COG/euk. The COG data set yeast with the former (Fig. 3C). is also available by anonymous ftp at ftp://ncbi.nlm.nih.gov/ pub/COG. Integration of COGs with the Genome Division of Entrez The COGs are now integrated with the Genomes division of the Entrez system. From the COG pages, the proteins are ACKNOWLEDGEMENTS linked to the Entrez genome view (the ‘Genome’ button) and to The authors are grateful to David Lipman for his critical contri- the protein neighbor view (the Blink button). Conversely, the bution at the initial stage of the COG project and constant Genomes division of Entrez (11) incorporates COG information support and inspiration and to Vivek Anantharaman, in several displays. The COG information including the break- L. Aravind, Kira Makarova, Igor Rogozin and Yuri Wolf for down by the functional categories is presented for each helpful suggestions. genome, for example: http://www.ncbi.nlm.nih.gov:80/cgi-bin/ Entrez/coxik?gi=131. The main page for each genome includes REFERENCES a (usually) circular genome map, with radial lines corre- sponding to genes color-coded according to the functional 1. Tatusov,R.L., Koonin,E.V. and Lipman,D.J. (1997) A genomic perspective on protein families. Science, 278, 631–637. categories adopted in the COG system. Additionally, for all 2. Fitch,W.M. (1970) Distinguishing homologous from analogous proteins. proteins that belong to COGs, the protein view is linked to the Syst. Zool., 19, 99–106. respective COG. 3. Tatusov,R.L., Galperin,M.Y., Natale,D.A. and Koonin,E.V. (2000) The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res., 28, 33–36. THE COG WORLDWIDE WEB SITE 4. Kawarabayasi,Y., Hino,Y., Horikawa,H., Yamazaki,S., Haikawa,Y., Jin-no,K., Takahashi,M., Sekine,M., Baba,S., Ankai,A. et al. (1999) The COG database is accessible at http://www.ncbi.nlm.nih.gov/ Complete genome sequence of an aerobic hyper-thermophilic COG. The site includes the following main features: complete crenarchaeon, Aeropyrum pernix K1. DNA Res., 6, 83–101. 28 Nucleic Acids Research, 2001, Vol. 29, No. 1 5. Natale,D.A., Shankavaram,U.T., Galperin,M.Y., Wolf,Y.I., Aravind,L. 8. Doolittle,W.F. (1998) You are what you eat: a gene transfer ratchet could and Koonin,E.V. (2000) Genome annotation using clusters of orthologous account for bacterial genes in eukaryotic nuclear genomes. Trends Genet., groups of proteins (COGs) – towards understanding the first genome of a 14, 307–311. Crenarchaeon. Genome Biol., 1, 0009.1–0009.19. 9. Natale,D.A., Galperin,M.Y., Tatusov,R.L. and Koonin,E.V. (2000) Using 6. The C.elegans Sequencing Consortium. (1998) Genome sequence of the the COG database to improve gene recognition in complete genomes. nematode C. elegans: a platform for investigating biology. The C.elegans Genetica, 108, 9–17. Sequencing Consortium. Science, 282, 2012–2018. 10. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. 7. Adams,M.D., Celniker,S.E., Holt,R.A., Evans,C.A., Gocayne,J.D., and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new Amanatides,P.G., Scherer,S.E., Li,P.W., Hoskins,R.A., Galle,R.F. et al. generation of protein database search programs. Nucleic Acids Res., 25, (2000) The genome sequence of Drosophila melanogaster. Science, 287, 3389–3402. 2185–2195. 11. Tatusova,T.A., Karsch-Mizrachi,I. and Ostell,J.A. (1999) Complete genomes in WWW Entrez: data representation and analysis. Bioinformatics, 15, 536–543. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Nucleic Acids Research Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/the-cog-database-new-developments-in-phylogenetic-classification-of-O4jEPGGusW

Loading next page...

References (11)

S. Altschul, Thomas Madden, A. Schäffer, Jinghui Zhang, Zheng Zhang, W. Miller, D. Lipman (1997)
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic acids research, 25 17
M. Adams, S. Celniker, R. Holt, C. Evans, J. Gocayne, P. Amanatides, S. Scherer, P. Li, R. Hoskins, R. Galle, R. George, S. Lewis, S. Richards, M. Ashburner, S. Henderson, G. Sutton, J. Wortman, M. Yandell, Q. Zhang, L. Chen, R. Brandon, Y. Rogers, R. Blazej, M. Champe, B. Pfeiffer, K. Wan, C. Doyle, E. Baxter, G. Helt, C. Nelson, G. Gábor, J. Abril, A. Agbayani, H. An, C. Andrews-Pfannkoch, D. Baldwin, R. Ballew, A. Basu, J. Baxendale, L. Bayraktaroglu, E. Beasley, K. Beeson, P. Benos, B. Berman, D. Bhandari, S. Bolshakov, D. Borkova, M. Botchan, J. Bouck, P. Brokstein, P. Brottier, K. Burtis, D. Busam, H. Butler, É. Cadieu, A. Center, I. Chandra, J. Cherry, S. Cawley, C. Dahlke, L. Davenport, P. Davies, B. Pablos, A. Delcher, Z. Deng, A. Mays, I. Dew, S. Dietz, K. Dodson, L. Doup, M. Downes, S. Dugan-Rocha, B. Dunkov, P. Dunn, K. Durbin, C. Evangelista, C. Ferraz, S. Ferriera, W. Fleischmann, C. Fosler, A. Gabrielian, N. Garg, W. Gelbart, K. Glasser, A. Glodek, F. Gong, J. Gorrell, Z. Gu, P. Guan, M. Harris, N. Harris, D. Harvey, T. Heiman, J. Hernandez, J. Houck, D. Hostin, K. Houston, T. Howland, M. Wei, C. Ibegwam, M. Jalali, F. Kalush, G. Karpen, Z. Ke, J. Kennison, K. Ketchum, B. Kimmel, C. Kodira, C. Kraft, S. Kravitz, D. Kulp, Z. Lai, P. Lasko, Y. Lei, A. Levitsky, J. Li, Z. Li, Y. Liang, X. Lin, X. Liu, B. Mattei, T. McIntosh, M. McLeod, D. McPherson, G. Merkulov, N. Milshina, C. Mobarry, J. Morris, A. Moshrefi, Stephen Mount, M. Moy, B. Murphy, L. Murphy, D. Muzny, D. Nelson, D. Nelson, K. Nelson, K. Nixon, D. Nusskern, J. Pacleb, M. Palazzolo, G. Pittman, S. Pan, J. Pollard, V. Puri, M. Reese, K. Reinert, K. Remington, R. Saunders, F. Scheeler, H. Shen, B. Shue, I. Sidén-Kiamos, M. Simpson, M. Skupski, T. Smith, E. Spier, A. Spradling, M. Stapleton, R. Strong, E. Sun, R. Svirskas, C. Tector, R. Turner, E. Venter, A. Wang, X. Wang, Z. Wang, D. Wassarman, G. Weinstock, J. Weissenbach, S. Williams, WoodageT, K. Worley, D. Wu, S. Yang, Q. Yao, J. Ye, R. Yeh, J. Zaveri, M. Zhan, G. Zhang, Q. Zhao, L. Zheng, X. Zheng, F. Zhong, W. Zhong, X. Zhou, S. Zhu, X. Zhu, H. Smith, R. Gibbs, E. Myers, G. Rubin, J. Venter (2000)
The genome sequence of Drosophila melanogaster.
Science, 287 5461
R. Tatusov, E. Koonin, D. Lipman (1997)
A genomic perspective on protein families.
Science, 278 5338
W. Doolittle (1998)
You are what you eat: a gene transfer ratchet could account for bacterial genes in eukaryotic nuclear genomes.
Trends in genetics : TIG, 14 8
W. Fitch (1970)
Distinguishing homologous from analogous proteins.
Systematic zoology, 19 2
R. Tatusov, Michael Galperin, D. Natale, E. Koonin (2000)
The COG database: a tool for genome-scale analysis of protein functions and evolution
Nucleic acids research, 28 1
D. Natale, U. Shankavaram, Michael Galperin, Y. Wolf, L. Aravind, E. Koonin (2000)
Towards understanding the first genome sequence of a crenarchaeon by genome annotation using clusters of orthologous groups of proteins (COGs)
Genome Biology, 1
T. Tatusova, I. Karsch-Mizrachi, J. Ostell (1999)
Complete genomes in WWW Entrez: data representation and analysis
Bioinformatics, 15 7-8
Y. Kawarabayasi, Y. Hino, H. Horikawa, Syuji Yamazaki, Y. Haikawa, Koji Jin-no, Mikio Takahashi, M. Sekine, S. Baba, Akiho Ankai, H. Kosugi, A. Hosoyama, Shigehiro Fukui, Y. Nagai, Keiko Nishijima, H. Nakazawa, M. Takamiya, S. Masuda, T. Funahashi, Toshihiro Tanaka, Y. Kudoh, J. Yamazaki, N. Kushida, A. Oguchi, Ken-ichi Aoki, K. Kubota, Y. Nakamura, N. Nomura, Y. Sako, H. Kikuchi (1999)
Complete genome sequence of an aerobic hyper-thermophilic crenarchaeon, Aeropyrum pernix K1.
DNA research : an international journal for rapid publication of reports on genes and genomes, 6 2
J. Berg (1998)
Genome sequence of the nematode C. elegans: a platform for investigating biology.
Science, 282 5396
(2000)
Genome annotation using clusters of orthologous groups of proteins (COGs) – towards understanding the first genome of a Crenarchaeon

Publisher: Oxford University Press
ISSN: 0305-1048
eISSN: 1362-4962
DOI: 10.1093/nar/29.1.22
Publisher site: See Article on Publisher Site

Abstract

22–28 Nucleic Acids Research, 2001, Vol. 29, No. 1 The COG database: new developments in phylogenetic classification of proteins from complete genomes Roman L. Tatusov, Darren A. Natale, Igor V. Garkavtsev, Tatiana A. Tatusova, Uma T. Shankavaram, Bachoti S. Rao, Boris Kiryutin, Michael Y. Galperin, Natalie D. Fedorova and Eugene V. Koonin* National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA Received October 2, 2000; Accepted October 11, 2000 ABSTRACT typically from newly sequenced genomes, to pre-existing COGs. Here we describe the new developments in the COGs The database of Clusters of Orthologous Groups of database in the year 2000, which included both the quantitative proteins (COGs), which represents an attempt on a update through addition of new genomes and development of phylogenetic classification of the proteins encoded new functionalities associated with the database. in complete genomes, currently consists of 2791 COGs including 45 350 proteins from 30 genomes of THE CURRENT STATUS OF THE COGS—NEW bacteria, archaea and the yeast Saccharomyces GENOMES cerevisiae (http://www.ncbi.nlm.nih.gov/COG). In addition, a supplement to the COGs is available, in Since the second release of the COG database in January 2000 (3), nine new genomes have been added to the database using which proteins encoded in the genomes of two multi- the COGNITOR program with subsequent manual validation cellular eukaryotes, the nematode Caenorhabditis to identify new members of pre-existing COGs and previously elegans and the fruit fly Drosophila melanogaster, described procedures for the construction of new COGs. The and shared with bacteria and/or archaea were additions included the first sequenced genome of a crenarchaeon included. The new features added to the COG data- (representative of the second major division of the archaea), base include information pages with structural and Aeropyrum pernix; a fifth representative of the Euryarchaea, functional details on each COG and literature refer- Pyrococcus abyssi; and seven bacterial genomes, including ences, improvements of the COGNITOR program that those from unusual organisms such as the extremely radio- is used to fit new proteins into the COGs, and classi- resistant Deinococcus radiodurans (Table 1). The previously fication of genomes and COGs constructed by using described trend held with the new genomes in that 60–80% of principal component analysis. the proteins from each of the prokaryotic genomes could be included in COGs (Table 1). The genome of the crenarchaeon A.pernix (4), which was of INTRODUCTION particular interest because this major evolutionary lineage had not been previously represented among completely sequenced The database of Clusters of Orthologous Groups of proteins genomes, was investigated in detail as a benchmark for (COGs) has been incepted as a phylogenetic classification of annotation of newly sequenced genomes using the COG proteins from complete genomes (1). Each COG includes system (5). The COG analysis resulted in an ∼50% increase in proteins that are thought to be orthologous, i.e. connected confident functional prediction for A.pernix genes compared to through vertical evolutionary descent (2). Orthology may the original annotations. On the other hand, a significant involve not only one-to-one, but also, in cases of lineage- fraction of open reading frames (ORFs), originally annotated specific gene duplications, one-to-many and many-to-many as genes, did not show detectable similarity to any proteins in relationships (hence Orthologous Groups of proteins). The current databases, but overlapped with proteins included in the purpose of the COGs database is to serve as a platform for COGs, strongly suggesting that these ORFs were not real functional annotation of newly sequenced genomes and for genes (Table 2). Thus the analysis of the genome of an studies on genome evolution. To facilitate functional studies, the organism that had no close relatives among other organisms COGs have been classified into 17 broad functional categories, with sequenced genomes appears to corroborate the effective- including a class for which only a general functional prediction, ness of the COG system as a genome annotation tool. usually that of biochemical activity, was feasible and a class of uncharacterized COGs. Additionally, some of the COGs with Given the accumulation of multiple, complete genome known functions are organized to represent specific cellular sequences, we were interested in the growth dynamics of the systems and biochemical pathways. The database is accompanied COG set with the increased number of included genomes. The by the COGNITOR program, which assigns new proteins, growth curve was constructed by imitating the COG formation *To whom correspondence should be addressed. Tel: +1 301 435 5913; Fax: +1 301 480 9241; Email: [email protected] Nucleic Acids Research, 2001, Vol. 29, No. 1 23 Table 1. Representation of genomes in the COGs Species Total no. of encoded proteins No. of proteins assigned to COGs Proteins in COGs (%) Archaea Archaeoglobus fulgidus 2420 1817 75 Methanococcus jannaschii 1786 1301 74 Methanobacterium thermoautotrophicum 1873 1365 73 P.abyssi 1767 1430 81 Pyrococcus horikoshii 2080 1353 66 A.pernix 2722 1157 43 Bacteria Aquifex aeolicus 1560 1312 84 Bacillus subtilis 4118 2767 67 Borrelia burgdorferi 1637 693 43 Campylobacter jejuni 1634 1282 78 Chlamydia trachomatis 895 630 71 Chlamydia pneumoniae 1053 646 62 D.radiodurans 3194 2133 67 Escherichia coli 4285 3308 77 Haemophilus influenzae 1695 1497 88 Helicobacter pylori 1578 1070 68 Mycobacterium tuberculosis 3924 2456 63 Mycoplasma genitalium 471 374 79 Mycoplasma pneumoniae 680 419 62 Neisseria meningitidis 2081 1446 70 Pseudomonas aeruginosa 5567 4166 75 Rickettsia prowazekii 836 673 81 Synechocystis sp. 3168 2048 65 Thermotoga maritima 1858 1497 81 Treponema pallidum 1036 705 68 Vibrio cholerae 3828 2715 71 Ureaplasma urealyticum 613 398 64 Xylella fastidiosa 2766 1481 54 Eukaryotes S.cerevisiae 5964 2158 36 Total 68 571 45 350 66 Newly added genomes are underlined. The low fraction of proteins assigned to COGs is probably due to over-prediction of protein-coding genes in the original genome annotation (see text and Table 2) The low fraction of proteins assigned to COGs is due to the fact that part of the genome consists of multiple plasmids that code for poorly conserved proteins Table 2. Analysis of the predicted A.pernix proteins using the COG system Originally predicted Proteins assigned to COGs Original ORFs overlapping Predicted proteins after proteins Predicted function Function unknown with COG members COG analysis Number 2722 833 315 849 1843 %original gene set 100 31 12 32 68 % gene set after COG analysis 146 45 17 NA 100 NA, not applicable. for each of the 10 random orders of genome inclusion (Fig. 1). expected dynamics of the COG growth. Given that the number For each number of species, the maximum, the minimum and of completely sequenced genomes is still relatively small and the average number of COGs was determined. The minimal that some of them are closely related, it remains uncertain and the maximal curves define the area containing all possible whetherornot thenumberofCOGs is startingto approach growth curves (Fig. 1). The average curve approximates the saturation, and if it is, what is the asymptotic value. 24 Nucleic Acids Research, 2001, Vol. 29, No. 1 of archaea and eukaryotes; (ii) proteins encoded by genes that have been horizontally transferred from organelles to the eukaryotic nucleus or otherwise acquired by eukaryotes from bacteria (8). Analysis of the phylogenetic patterns in the COGs may help distinguish between these two categories. Table 3. Eukaryotic proteins in the COGs Functional category Eukaryotic proteins assigned to COGs S.cerevisiae C.elegans D.melanogaster Translation 276 221 270 Transcription 107 134 167 Replication and repair 165 186 159 Post-translational modification, 167 260 273 chaperone functions Cell division and chromosome 23 22 19 partitioning Cell motility and secretion 10 17 14 Cell envelope biogenesis, outer 29 62 47 membrane Inorganic ion transport 85 199 132 Figure 1. Growth dynamics of the COG set with the increase of number of Signal transduction included genomes. The circles show the sequence of genome inclusion according to the actual order of sequencing, and the smooth line shows the mean of 10 Energy production and conversion 116 138 183 random permutations of the genome order. The colored area indicates the Carbohydrate transport and 176 295 300 range between the maximal and minimal value for each point (number of 6 metabolism genomes) in 10 random permutations. Aminoacidtransport and 180 193 222 metabolism Nucleotide transport and 85 88 99 ADDING PROTEINS FROM MULTICELLULAR metabolism EUKARYOTES TO PROKARYOTIC COGs Coenzyme metabolism 86 63 69 Lipid metabolism 52 237 169 The current COG collection includes multiple bacterial and archaeal genomes and only one eukaryotic species, the yeast General function prediction only 344 673 635 Saccharomyces cerevisiae. Incorporating the larger genomes of Function unknown 53 60 84 multicellular eukaryotes into the COG system is a challenging Total 1954 2848 2842 task due to the preponderance of multidomain proteins in these organisms. As a first step toward this goal, we sought to identify eukaryotic proteins that fit into already existing COGs, in other After three distant eukaryotic genomes were included in the words, those eukaryotic proteins that have orthologs in at least prokaryotic COGs, it was of interest to analyze their co-occurrence. two prokaryotic species. To this end, 19 895 protein sequences As expected, the majority of COGs with eukaryotic members from the (nearly) complete genome of the nematode Caenorhab- include all three genomes; at the same time, a considerable ditis elegans (6) and 14 100 sequences from the genome of the number of COGs include all possible pairs of eukaryotic fruit fly Drosophila melanogaster (7) were analyzed using the genomes and each of the individual species (Table 4). These COGNITOR program, which assigns proteins to COGs on the observations, which will be analyzed in detail elsewhere, support basis of multiple genome-specific best hits and splits multi- domain protein into individual domains if these show affinity the major role of lineage-specific gene loss and horizontal gene with different COGs. After manual validation of the results, transfer in eukaryotic evolution. 20% of the D.melanogaster proteins and 14% of the C.elegans proteins were assigned to COGs; a significant number of Table 4. Co-occurrence of the eukaryotic genomes in the COGs proteins from each of the multicellular eukaryotes were included in COGs of each functional category, with the notable Numbers of COGs Eukaryotic species exception of ‘Cell division and chromosome partitioning’ and 578 C.elegans D.melanogaster S.cerevisiae ‘Cell motility and secretion’, which consist primarily of 99 C.elegans D.melanogaster prokaryote-specific proteins (Table 3). The COG analysis of the 38 C.elegans S.cerevisiae worm and fly proteins yielded numerous functional predictions, which have not been described previously (I.V.Garkavtsev and 46 C.elegans E.V.Koonin, unpublished observations). Eukaryotic proteins 77 D.melanogaster S cerevisiae that have orthologs in prokaryotes belong to two major catego- 44 D.melanogaster ries: (i) ancient proteins inherited from the last common 166 S.cerevisiae ancestor of all extant life forms or at least the common ancestor Nucleic Acids Research, 2001, Vol. 29, No. 1 25 Table 5. Detection of missed proteins using phylogenetic pattern analysis Species Number of previously undetected COGs including new proteins proteins assigned to COGs A.fulgidus 3 1143, 1255, 1698 M.jannaschii 4 0286, 0827, 1908, 1996 M.thermoautotrophicum 1 2888 P.abyssi 1 2888 P.horikoshii 15 1383, 1761, 1919, 1998, 2004 2051, 2075, 2092, 2093, 2097 2167, 2212, 2260, 2443, 2888 A.pernix 19 0640, 1522, 1605, 1694, 1848 1858, 2002, 2118, 2260, 2443 A.aeolicus 6 0254, 0255, 0690, 0858, 1828 B.subtilis 2 1582, 1863 C.trachomatis 1 1314 D.radiodurans 2 1863, 2120 H.influenzae 1 1826 H.pylori 1 0690 M.tuberculosis 1 0458 M.genitalium 1 0828 M.pneumoniae 3 0816, 0828, 1546 T.maritima 2 0230, 1886 T.pallidum 1 0268 DETECTING MISSED GENES NEW FEATURES ASSOCIATED WITH THE COGS One of the features associated with the COG database is the Improvement of the COGNITOR program—statistical analysis of phylogenetic patterns, i.e. the patterns of species evaluation of the fit that are represented or not represented in each of the COGs. The original COGNITOR program uses multiple genome-specific Unexpected phylogenetic patterns, for example, those that best hits (BeTs) as the only criterion for assigning new proteins contain all but one bacterial species or those that include only to COGs. In the new version, we introduced an estimate of the one of a pair of closely related species, may be due to omission probability that the query protein is assigned to the given COG of genes in genome annotations submitted to GenBank or to by chance. Under the assumption of uniform distribution of unusual evolutionary phenomena such as non-orthologous hits to each genome in the COG database, the probability of displacement of a nearly ubiquitous gene. Before considering one BeT into a particular COG is, simply, the fraction of the second hypothesis, the first one should be tested, and we proteins from the specified genome that belongs to the COG: undertook a systematic analysis of COGs with unexpected f =n /N ij ij i phylogenetic patterns in search of missing members (9). The where n is the number of proteins from species i in COG j and N ij i nucleotide sequence of the genome in question was searched is the total number of proteins in species i. Then, the probability of using the TNBLASTN program (10) and the sequences of exactly two BeTs into COG j is given by: members of the respective COGs as queries. As a result, p j=1/2Σf f Π (1 – f ) missing genes coding for members of 48 COGs were identified 2 ij kj lj l# i (Table 5); most of the predicted new proteins are small, which l# k explains why they have escaped the original genome annotations. Thus the COG system is instrumental in improving genome Similar expressions can be easily obtained for a different annotation not only with respect to functional predictions, but also number of BeTs. For each COG, we can compute p and find 2j for gene identification per se. the ‘average’ value of Fj that satisfies the equation: 26 Nucleic Acids Research, 2001, Vol. 29, No. 1 Figure 2. An example of a COG-Info page. proteins comprising the COG and the three-dimensional 2 (m–2) C(2,m)Fj (1 – Fj) =p 2j structure of the domains if known or predictable; a succinct where m is the number of species in COG j. Using Fj simplifies summary of the common structural and functional features of the calculation of the probability when the specified number of the COG members and peculiarities of individual members; BeTs is large. key references (Fig. 2). The COG-Info pages are currently at different stages of construction. COG-Info pages Classification of genomes on the basis of co-occurrence in In order to increase the utility of the COG system for genome COGs using principal component analysis annotation, a web page that contains additional structural and functional information on the COG as a whole and individual The data on the co-occurrence of genomes in COGs was used members is now associated with each COG. These hyperlinked as the input for classification by principal component analysis Info pages include: systematic classification of the COG (PCA). Briefly, the presence or absence of a given species in members under the current classification systems for enzyme each COG is converted into a 1/0 coordinate value in a multi- or transporters (if applicable); indications which COG dimensional space where each dimension corresponds to a members (if any) have been characterized genetically and COG, which results in a geometric representation of all included biochemically; information on the domain architecture of the species in the >2000-dimensional space. The PCA analysis is then Nucleic Acids Research, 2001, Vol. 29, No. 1 27 Figure 3. Classification of genome by co-occurrence in COGs using PCA. (A) All COGs. (B) Translation, transcription and replication (functional categories J, K and L). (C) Metabolism (functional categories C, E, F, G, H and I). used to choose the subspace of lower dimensionality for visual list of all COGs hyperlinked to individual COG pages; COGs organized by functional category; COGs organized by examination. The eigenvector decomposition yields the functional complexes and pathways; an interactive matrix of orthogonal courses in the space and the corresponding co-occurrence of genomes in COGs; a phylogenetic pattern eigenvectors constitute the spread of the objects. The WWW search tool; a principal component classification tool; interface provides tools for selection of the subspace, the COGNITOR; a COG Help page. Each of the individual COG species to view and the COGs to use for classification pages is hyperlinked to: (i) pictorial representations of BLAST (Fig. 3A). Significantly different results were obtained when search outputs for each member of the COG, which also different functional categories of COGs were analyzed. include links to the respective GenBank and Entrez-Genomes Specifically, the combined categories of translation, transcription entries, (ii) a multiple alignment of the COG members and replication showed a sharp separation between bacteria, produced automatically by using the ClustalW program, (iii) a archaea and eukaryotes, with representatives of each of these COG-Info page (reached by clicking on the COG number). primary domains of life forming a tight cluster (Fig. 3B); the The supplement to the COGs, which shows proteins from metabolic functions produced a more complex picture, with a C.elegans and D.melanogaster assigned to each COG is accessible separation of free-living and parasitic bacteria and grouping of at http://www.ncbi.nlm.nih.gov/COG/euk. The COG data set yeast with the former (Fig. 3C). is also available by anonymous ftp at ftp://ncbi.nlm.nih.gov/ pub/COG. Integration of COGs with the Genome Division of Entrez The COGs are now integrated with the Genomes division of the Entrez system. From the COG pages, the proteins are ACKNOWLEDGEMENTS linked to the Entrez genome view (the ‘Genome’ button) and to The authors are grateful to David Lipman for his critical contri- the protein neighbor view (the Blink button). Conversely, the bution at the initial stage of the COG project and constant Genomes division of Entrez (11) incorporates COG information support and inspiration and to Vivek Anantharaman, in several displays. The COG information including the break- L. Aravind, Kira Makarova, Igor Rogozin and Yuri Wolf for down by the functional categories is presented for each helpful suggestions. genome, for example: http://www.ncbi.nlm.nih.gov:80/cgi-bin/ Entrez/coxik?gi=131. The main page for each genome includes REFERENCES a (usually) circular genome map, with radial lines corre- sponding to genes color-coded according to the functional 1. Tatusov,R.L., Koonin,E.V. and Lipman,D.J. (1997) A genomic perspective on protein families. Science, 278, 631–637. categories adopted in the COG system. Additionally, for all 2. Fitch,W.M. (1970) Distinguishing homologous from analogous proteins. proteins that belong to COGs, the protein view is linked to the Syst. Zool., 19, 99–106. respective COG. 3. Tatusov,R.L., Galperin,M.Y., Natale,D.A. and Koonin,E.V. (2000) The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res., 28, 33–36. THE COG WORLDWIDE WEB SITE 4. Kawarabayasi,Y., Hino,Y., Horikawa,H., Yamazaki,S., Haikawa,Y., Jin-no,K., Takahashi,M., Sekine,M., Baba,S., Ankai,A. et al. (1999) The COG database is accessible at http://www.ncbi.nlm.nih.gov/ Complete genome sequence of an aerobic hyper-thermophilic COG. The site includes the following main features: complete crenarchaeon, Aeropyrum pernix K1. DNA Res., 6, 83–101. 28 Nucleic Acids Research, 2001, Vol. 29, No. 1 5. Natale,D.A., Shankavaram,U.T., Galperin,M.Y., Wolf,Y.I., Aravind,L. 8. Doolittle,W.F. (1998) You are what you eat: a gene transfer ratchet could and Koonin,E.V. (2000) Genome annotation using clusters of orthologous account for bacterial genes in eukaryotic nuclear genomes. Trends Genet., groups of proteins (COGs) – towards understanding the first genome of a 14, 307–311. Crenarchaeon. Genome Biol., 1, 0009.1–0009.19. 9. Natale,D.A., Galperin,M.Y., Tatusov,R.L. and Koonin,E.V. (2000) Using 6. The C.elegans Sequencing Consortium. (1998) Genome sequence of the the COG database to improve gene recognition in complete genomes. nematode C. elegans: a platform for investigating biology. The C.elegans Genetica, 108, 9–17. Sequencing Consortium. Science, 282, 2012–2018. 10. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. 7. Adams,M.D., Celniker,S.E., Holt,R.A., Evans,C.A., Gocayne,J.D., and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new Amanatides,P.G., Scherer,S.E., Li,P.W., Hoskins,R.A., Galle,R.F. et al. generation of protein database search programs. Nucleic Acids Res., 25, (2000) The genome sequence of Drosophila melanogaster. Science, 287, 3389–3402. 2185–2195. 11. Tatusova,T.A., Karsch-Mizrachi,I. and Ostell,J.A. (1999) Complete genomes in WWW Entrez: data representation and analysis. Bioinformatics, 15, 536–543.

Journal

Nucleic Acids Research – Oxford University Press

Published: Jan 1, 2001

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

The COG database: new developments in phylogenetic classification of proteins from complete genomes

The COG database: new developments in phylogenetic classification of proteins from complete genomes

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

The COG database: new developments in phylogenetic classification of proteins from complete genomes

The COG database: new developments in phylogenetic classification of proteins from complete genomes

References (11)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies