I progek di ricerca. Associazioni di ricerca * I progek infrastru<urali. I progek di ricerca 7-04- 2014

Doing scien0fic research: context and guidelines Nicole<a Dessì 8 /4/2014 Il contesto opera0vo La ricerca Associazioni di ricerca * I progek di ricerca Sono gli strumen0 per finanziare la ricerca - Europei Molto complessi e difficili. Coivolgono partners di 2 o 3 nazioni. - Nazionali PRIN Unità base + unità locali (annuale) FIRB- Propos0 da giovani ricercatori - Locali Università (CAR, annuale, riservato ai ricercatori akvi) Regione Sardegna (annuali) * Esempio per l informa0ca I progek di ricerca Prevedono una traccia ar0colata delle ricerche da svolgere da parte di una o piu unità (che includono i do<orandi) coordinate da un responsabile. Hanno un piano finanziario che puo prevedere l acquisizione di: - Apparecchiature - Personale (ricercatori a tempo, assegnis0, borsis&, do<orandi,contrak) - Rimborso spese missioni, congressi etc I progek infrastru<urali Sono gli strumen0 per finanziare l acquisizione di apparecchiature o la realizzazione di specifici servizi. Non includono borse o assegni per la ricerca, ma contrak. Non mirano alla produzione scien0fica, ma alla realizzazione di un obiekvo 1

I progek di ricerca Il Dipar0mento e, gerarchicamente anche l Ateneo dal MIUR, viene valutato e anche finanziato sulla base dei fondi per la ricerca che ha avuto la capacità di acquisire. I progek sono giudica0 da almeno due revisori anonimi qualifica0, a volte anche non italiani. Come si conduce una ricerca (1) Passo per passo con piccoli avanzamen0 rispe<o a quello che finora si è fa<o i quel se<ore (Stato dell arte). Quasi sempre le idee nuove vengono leggendo i lavori già fak. Con0nuo up- date degli argomen0. Ad esempio, in informa0ca, un lavoro di due anni prima puo essere già vecchio, a meno che non si trak di una pietra miliare del se<ore. I Come si conduce una ricerca (2) ACQUISIRE LO STATO DELL ARTE Significa avere un quadro di riferimento del progredire delle ricerche in un se<ore. I lavori di rassegna inquadrano il problema e fanno risparmiare tempo. Serve a valutare quanto e come inves0re in un argomento di ricerca (individuazione problemi aper0). Tenere traccia di quanto acquisito perché u0le per le future pubblicazioni Come si conduce una ricerca (3) Sono necessarie almeno 3 componen0: - L idea base innova0va rispe<o allo stato dell arte - La verifica o realizzazione proto0pale (fakbilità di tale idea) - La vendita (pubblicazione dei risulta0) di tale idea Le aree scien0fico disciplinari 01 Matema0ca e Informa0ca 02- Scienze Fisiche 03- Scienze Chimiche 04- Scienze della Terra 05- Scienze Biologiche 06- Scienze Mediche 07- Agraria e Veterinaria08 08- Ingegneria Civile e Archite<ura 09- Ingegneria Industriale e dell Informazione 10- Scienze dell an0chità, filologico le<erarie e storico ar0s0che 11- Scienze storiche, filosofiche,pedagogiche e psicologiche 12- Scienze Giuridiche 13- Scienze Economiche e Sta0s0che 14- Scienze Poli0che e Sociali Classificazione delle aree rispe<o alla ricerca Aree NON Bibliometriche : 08-10- 11-12- 13-14 Aree Bibliometriche : 01-02- 03-04- 05-06- 07-09 2

Aree Bibliometriche Aree NON Bibliometriche 01 Matema0ca e Informa0ca 02- Scienze Fisiche 03- Scienze Chimiche 04- Scienze della Terra 05- Scienze Biologiche 06- Scienze Mediche 07- Agraria e Veterinaria 09- Ingegneria Industriale e dell Informazione 08 Ingegneria Civile e Archite<ura 10- Scienze dell an0chità, filologico le<erarie e storico ar0s0che 11- Scienze storiche, filosofiche,pedagogiche e psicologiche 12- Scienze Giuridiche 13- Scienze Economiche e Sta0s0che 14- Scienze Poli0che e Sociali Diversificazione della produzione scien0fica (cosa conta) Aree NON bibliometriche a)numero di libri (dota0 di ISBN) b) numero di ar0coli su rivista e di capitoli su libro (con ISBN) c) Numero di ar0coli su riviste appartenen0 alla classe A. h<p://www.anvur.org/index.php? op0on=com_content&view=ar0cle&id=254&itemid=315&lang=it Diversificazione della produzione scien0fica (cosa conta) Aree NON bibliometriche - La produzione scien0fica non è sogge<a a revisione di esper0 (tranne che per le riviste) - L autore delle monografie è spesso l editore delle stesse. - Non si valuta il livello di diffusione del prodo<o scien0fico. Pubblicazioni PREVALENTEMENTE in ITALIANO Diversificazione della produzione scien0fica (cosa conta) Aree Bibliometriche Numero di ar0coli su riviste contenute nelle principali banche da0 internazionali (ISI e SCOPUS) Numero totale di citazioni ricevute riferite alla produzione scien0fica complessiva ed all età accademica H- index (Indice di Hirsch contemporaneo) Un autore ha H index N se N sue pubblicazioni hanno ricevuto N citazioni. Diversificazione della produzione scien0fica (cosa conta) Aree Bibliometriche - La produzione scien0fica è sogge<a a revisione di esper0 (Peer Review) - L editore ed i revisori possono rifiutare la pubblicazione. - Si valuta il livello di diffusione del prodo<o scien0fico Pubblicazioni ESCLUSIVAMENTE in INGLESE 3

Tipologia di una pubblicazione - A<o di Congresso Da 6 a 14 pagine (limite definito dalla call del congresso, Per le riviste non esiste limite) Comprende: TITOLO AUTORI (in ordine alfabe0co per Mat e INF, in ordine di Importanza per Bio e Med (principal inves0gator etc ) ABSTRACT Parole chiave (Keywords) Tipologia di una pubblicazione - A<o di Congresso rispe<a un iter specifico: - Call For Papers e Important Dates (diffusione ele<ronica) - Invio Lavoro (in forma ele<ronica es. EASY CHAIR) - Peer Review - Comunicazione Giudizio revisori (acce<azione/rifiuto) - Recepire modifiche suggerite dai revisori - Iscrizione a Congresso - Presentazione (15 in inglese) - Pubblicazione dei Proceedings Assicurarsi che il Congresso sia citato su ISI/Scopus/ A<o di Congresso - A<o di Congresso In Matema0ca,Fisica,Chimica e alcuni se<ori Bio i lavori presenta0 ai congressi hanno importanza trascurabile. In Informa0ca, alcuni Congressi sono considera0 della stessa importanza dei lavori su rivista, specie quelli che compaiono in collane (esempio LNCS,Lecture Notes In Computer Science) o in Congressi che si svolgono da mol0 anni (VLDB,DEXA etc ) All interno dei Congressi si tengono workshops su argomen0 specifici. Assicurarsi che i lavori allo workshop siano pubblica0 nei Proceedings del Congresso e non a parte. Ar0colo su Rivista - Ar0colo su rivista E un ar0colo che presenta un lavoro completo, a volte una extended version di un lavoro presentato ad un congresso - Invio Lavoro - Comunicazione Giudizio revisori (acce<azione con modifiche/rifiuto); ci me<e anche 1 anno o piu. - Inserimento modifiche e nuova revisione - Proofs e pubblicazione Assicurarsi la rivista su ISI/Scopus e valutarne l importanza (es.quar0le SCIMAGO ovvero elenchi specifici di se<ore) Tipologia di una pubblicazione - LIBRI (argomen0 di ricerca, con ISBN) - ScriK da un solo autore (rari, monografie) - Raccolta di ar0coli di autori diversi effe<uata da uno o piu editors. H- index Esempio. L autore X ha H- index 3 se almeno 3 delle sue pubblicazioni sono citate ciascuna 3 volte. Problema del conteggio delle cita0ons. Normalizzazione rispe<o all età accademica (la prima pubblicazione) Le cita0ons sono un parametro di valutazione concorsuale. 4

Stru<ura di una pubblicazione (1) Una pubblicazione ha la seguente stru<ura standard: Stru<ura di una pubblicazione - Titolo e autori con affiliazione - Abstract (breve, che riassume il lavoro) - Introduc0on - Related Work - Sec0on 1. - Sec0on 2. - Conclusions /Future work - Aknowledgements - References Titolo,Autori Abstract esempio.. BioCloud Search EnGene: Surfing Biological Data on the Cloud Nicole<a Dessì, Emanuele Pascariello, Gabriele Milia, Barbara Pes Università degli Studi di Cagliari, Dipar0mento di Matema0ca e Informa0ca, Via Ospedale 72, 09124 Cagliari, Italy dessi@unica.it, emanuele.pascariello@gmail.com, milia.ga@unica.it, pes@unica.it Abstract. The massive produc0on and spread of biomedical data around the web introduces new challenges related to iden0fy computa0onal approaches for providing quality search and browsing of web resources. This papers presents BioCloud Search EnGene (BSE), a cloud applica0on that facilitates searching. Keywords: Biomedical data explora0on, Cloud compu0ng, Data searching, Data integra0on, Dataspaces, Pay- as- you- go data querying. Stru<ura di una pubblicazione (2) - Introduc0on Presenta l inquadramento del lavoro, cioè cosa è stato fa<o in precedenza(, le mo0vazioni e gli aspek innova0vi del lavoro che si presenta e in che cosa si differenzia dai preceden0. Termina con una brevissima sintesi su come il lavoro è stru<urato. 1 IntroducGon The massive produc0on and spread of biomedical data around the web introduces new challenges related to iden0fy computa0onal approaches for their management and exploita0on. These challenges mainly result from three issues: - Biomedical data are typical of the category of big data [1]. The term big data refers to the ever increasing amount of informa0on that organiza0ons are storing, processing and analyzing, owning the growing number of informa0on sources in use [2]. Fine dell introduzione.. The paper is organized as follows. Sec0on 2 provides background concepts and mo0vates the adop0on of dataspace and cloud paradigms. Sec0ons 3 details the architectural aspects of BSE. The system func0onali0es are described in sec0on 4. Finally, sec0on 5 presents conclusions. Referenze 5

Comprehensive review of semantic similarity measures. Suggestions concerning the best uses of semantic similarity measures tailored to different contexts. Assessment with biological features. Critical discussion of common issues. Outline of future direction of research. 1. Cannataro M, Guzzi PH, Veltri P. Protein Interaction Data: technologies, databases and algorithms. ACM Comput Sur 2010;43:1 36. 2. Baclawski K, Niu T. Ontologies for Bioinformatics (Computational Molecular Biology). Cambridge, MA: The MIT Press, 2005. Maurizio Atzori University of Cagliari e-mail: atzori@unica.it Nicoletta Dessì University of Cagliari e-mail: dessi@unica.it 1 The work of Dr. Atzori has been done within the project Unstructured Data Integration for Dataspaces (U-DID) founded by RAS PO Sardegna FSE 2007-2013 L.R.7/2007 BRIEFINGS IN BIOINFORMATICS. page 1 of 17 Submitted: 5th August 2011; Received (in revised form): 30th September 2011 Corresponding author. Pietro H. Guzzi, Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, Viale Europa (Loc. Germaneto), 88100 Catanzaro, Italy. E-mail: hguzzi@unicz.it, hguzzi@gmail.com *These authors contributed equally to this work Pietro H. Guzzi is an Assistant Professor of Computer Engineering at the University Magna Græcia of Catanzaro, Italy, since 2008. He received his PhD in Biomedical Engineering in 2008, from Magna Græcia University of Catanzaro. He received his Laurea degree in Computer Engineering in 2004 from the University of Calabria, Rende, Italy. His research interests comprise bioinformatics, the analysis of proteomics data, and the analysis of protein interaction networks. Pietro is an ACM member and serves the scientific community as reviewer for many conferences. He is associate editor of Information Science journal, and of SIGBioinformatics Record. Marco Mina is a Ph.D. student at the Department of Information Engineering, University of Padova, Italy, since 2010. He received the bachelor degree and the master degree in Computer Science and Engineering from the University of Padova, Italy, in 2009 and 2007, respectively. His research interests comprise bioinformatics, in particular the analysis of protein interaction networks and the integration of heterogeneous data. Concettina Guerra is a professor at the Department of Information Engineering of the University of Padova, Italy, and at the College of Computing of the Georgia Institute of Technology, Atlanta, GA, USA. Her research activity is in the areas of Computational Biology, Bioinformatics and Computer Vision. Her recent interests fall in the domains of protein classification, recognition and docking and of comparative analysis of biological networks. She has been on the faculty of the University of Rome, Italy and of Purdue University, USA, for over a decade. She has visited extensively with US Institutions, including Rensseleaer Polytechnic and Carnegie Mellon University. Dr Guerra is a founding member of the steering committee of the International Symposium on 3D Data Processing Visualization and Transmission, that she co-chaired in 2002. She was Co-Director of the CIME School on Mathematical Methods for Protein Structure Analysis and Design (2000) and chairman of the fifth IEEE International Workshop on Computer Architectures for Machine Perception (2000), general chairman of the 10th International Conference on Research in Computational Molecular Biology, RECOMB06 and Co-Director of the series of Lipari Schools in Bioinformatics and Computational Biology. Mario Cannataro is Associate Professor of Computer Engineering at the Magna Græcia University of Catanzaro, Department of Medical and Surgical Sciences, and an Associate Researcher at ICAR-CNR, Italy. He worked on parallel computing, massively parallel architectures, parallel implementation of logic programs and cellular automata. His current research explores bioinformatics, computational proteomics and genomics, medical informatics, grid and parallel computing and adaptive web systems. Dr Cannataro has published three books and more than 150 papers in international journals and conference proceedings. He is a Senior Member of ACM and a member of IEEE Computer Society and BITS (Italian Bioinformatics Society). Dr. Cannataro is a co-founder and a member of Exeura (www.exeura.com) and EasyAnalysis (www.easyanalysis.it). ß The Author 2011. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com 3. Harris MA, Clark J, Ireland A, etal. The gene ontology (go) database and informatics resource. Nucleic AcidsRes 2004;32: 258 61. 4. du Plessis L, Škunca N, Dessimoz C. The what, where, how and why of gene ontology, a primer for bioinformaticians. Brief Bioinform 2011; doi: 10.1093/bib/bbr002. 5. Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 2009;37: 1 13. 6. Pesquita C, Faria D, Falcao AO, et al. Semantic similarity in biomedical ontologies. PLoSComput Biol 2009;5:e1000443. 7. Pesquita C, Pessoa D, Faria D, et al. CESSM: Collaborative Evaluation of Semantic Similarity Measures, JB2009: Challenges in Bioinformatics 2009. 8. Wang J, Zhou X, Zhu J, et al. Revealing and avoiding bias in semantic similarity scores for protein pairs. BMC Bioinformatics 2010;11:290. 9. Ali W, Deane CM. Functionally guided alignment of protein interaction networks for module detection. Bioinformatics 2009;25:3166 73. 10. Cho Y-R, Hwang W, Ramanathan M, et al. Semantic integration to identify overlapping functional modules in protein interaction networks. BMC bioinformatics 2007;8:265. 11. Popescu M, Keller JM, Mitchell JA. Fuzzy measures on the Gene Ontology for gene product similarity. IEEE/ACM Trans Comput Biol Bioinform 2006;3:263 7412. 12. Martin D, Brun C, Remy E, et al. GOToolBox: functional analysis of gene datasets based on Gene Ontology. Genome Biol 2004;5:R101. 13. Benabderrahmane S, Smail-Tabbone M, Poch O, et al. IntelliGO: a new vector- based semantic similarity measure including annotation origin. BMC Bioinformatics 2010; 1:588. 14. Huang DW, Sherman BT, Tan Q, et al. The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol 2007;8:R183. 15. Mistry M, Pavlidis P. Gene Ontology term overlap as a measure of gene functional similarity. BMC Bioinformatics 2008;9:327. 16. Al-Mubaid H, Nagar A. Comparison of four similarity measures based on GO annotations for Gene Clustering. Report no. 3, 2008 IEEE Symposium on Computers and Communications, 6 9 July 2008. Morocco: Marrakech. 17. Pesquita C, Faria D, Bastos H, et al. Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics 2008;9(Suppl 5):S4. 18. Gentleman A. Visualizing GO Distances Using Bioconductor. http://bioconductor.org/packages/2.3/bioc/html/gostats. html (10 October 2011, date last accessed). 19. Ye P, Peyser BD, Pan X, et al. Gene function prediction from congruent synthetic lethal interactions in yeast. Mol Syst Biol 2005;1: 2005.0026. 20. Sheehan B, Quigley A, Gaudin B, et al. A relation based measure of semantic similarity for Gene Ontology annotations. BMC Bioinformatics 2008;9:468. 21. Lee HK, Hsu AK, Sajdak J, et al. Coexpression analysis of human genes across many microarray data sets. Genome Res 2004;14:1085. 1 doi:10.1093/bib/bbr066 Functions Measures Input data csbl.go [60] SS measures, Resnik, Lin, JiangConrath, Genes and Clustering GRaSM, simrel, Kappa Proteins based on SS Statistics, Cosine, annotations Weighted Jaccard, Czekanowski-Dice GOSemSim [61] SS measures Resnik, Lin, Jiang, simrel, GO Terms G-SESAME GOvis [62] SS measures simlp, simui Entrez gene IDs, Gene ontology Web server Functions Measures FuSSiMeG [47] SS measures, statistical tests Resnik, Lin, JiangCon- rath, GraSM http://xldb.fc.ul.pt/biotools/rebil/ssm/ ProteInOn [17] SSmeasures, searchfor Resnik, Lin, assigned GO Terms and JiangCon- rath, annotated proteins, simgic, GraSM, representative of simui GO Terms xldb.di.fc.ul.pt/tools/proteinon/ FunSimMat [63] SS measures, disease-related simrel, Lin, genes prioritization Resnik, JiangConrath http://funsimmat.bioinf.mpi-inf.mpg.de/ GOToolBox [12] SSmeasures, clustering Si, Sp, SCD http://genome.crg.es/gotoolbox/ G-SESAME [25] SSmeasures, clustering G-SESAME http://bioinformatics.clemson.edu/g-sesame None of these toolsrequiresinput annotations or GOs. 7-04- 2014 Autori in ordine ALFABETICO Dataspaces: where structure and schema meet Maurizio Atzori and Nicoletta Dessì Abstract. In this chapter we investigate the crucial problem that poses the bases to the concept of dataspaces: the need for human interaction/intervention in the process of organizing (getting the structure of) unstructured data. We survey the existing techniques behind dataspaces to overcome that need, exploring the structure of a dataspace along three dimensions: dataspace profiling, querying and searching and application domain.wewillfurther explore existing projects focusing on dataspaces, induction of data structure from documents, and data models where data schema and documents structure overlaps will be reviewed, such as Apache Hadoop, Cassandra on Amazon Dynamo, Google BigTable model and other DHT-based flexible data structures, Google Fusion Tables, imemex, U-DID, WebTables and Yahoo! SearchMonkey. 1 Introduction Data integration has emerged over the last few years as a challenge to improving search in vast collections of structured data that yield heterogeneity at scale unseen before. Current information systems and IT infrastructures are mainly based on the exchange of strongly-structured data and on wellestablished standards (database, XML files and other known data formats). Nevertheless, enterprise and personal data handled everyday are mostly unstructured (estimates range from 80 to 95%), i.e., their contents do not follow Stru<ura di una pubblicazione (3) - Sec0ons. Sono i vari paragrafi che descrivono per pun0 il lavoro svolto - Conclusions / Future work. Tirano le conclusioni ed eventuali sviluppi futuri - Aknowledgements - References Bibliografia estesa riferita con numeri all interno del paper (es [1]) Tu<o il paper è forma<ato secondo quanto richiesto dall editore. page 14 of 17 Guzzi et al. Ordine degli autori NON allfabe0co Esempio da PubMed Briefings in Bioinformatics Advance Access published December 2, 2011 Semantic similarity analysis of protein data: assessment with biological features and issues Pietro H. Guzzi*, Marco Mina*, Concettina Guerra and Mario Cannataro Abstract Theintegration ofproteomics datawithbiologicalknowledgeis arecent trendinbioinformatics. Alotofbiologicalinformation is available and is spread on different sources and encoded in different ontologies (e.g. Gene Ontology). Annotating existing protein data with biological information may enable the use (and the development) of algorithms that use biological ontologies as framework to mine annotated data. Recently many methodologies and algorithms that use ontologies to extract knowledge from data, as well as to analyse ontologies themselves have been proposed and applied to other fields. Conversely, the use of such annotations for the analysis of protein data is a relatively novel research area that is currently becoming more and more central in research. Existing approaches span from the definition of the similarity among genes and proteins on the basis of the annotating terms, to the definition of novel algorithms that use such similarities for mining protein data on a proteome-wide scale.thiswork, after the definitionofmainconceptof suchanalysis, presents a systematicdiscussionandcomparison of main approaches. Finally, remaining challenges, as well as possible future directions of research are presented. Keywords: Semantic similarity measures; protein data; biological features Downloaded from http://bib.oxfordjournals.org/ by guest on March 6, 2014 similarity. However, most of groupwise approaches do not take into account term specificity and behave poorly. SimGIC is the only groupwise measure competing with pairwise approaches. Actually, Resnik is one of the most considered semantic similarity measure, always included in assessment works and behaving properly most of the times. More recent approaches based on term specificity such as G-SESAME, simgic, simic and TCSS seem to outperform Resnik in several cases, but with the exception of simgic they have not been included in many assessment or comparison works. Anyhow, we believe they represent the next generation of semantic similarity measures that should be used. All of them offer improvements over Resnik in different directions, resolving some of the issues presented above. TOOLS ANDAPPLICATIONS FOR THE SEMANTIC ANALYSIS This section presents some existing tools implementing SS measures. The current scenario is characterized from the absence of a tool that implements all the SS measures or that is easily extendible. Considering the distribution, tools are mainly available as web servers (Table 6) or as packages for the R platform (Table 5). However, FuSSiMeG, ProteInOn, FunSimMat, csbl.go and SemSim together cover almost all the similarity measures. In general, tools are based on GO and annotation corpora. Some tools, such as the web servers, include their own copy of annotation corpora and GO, offering user-friendly and ready-to-go solutions. However, they rely on maintainers for updated data, and generally do not offer many possibilities of customization or extension. On the contrary, other tools such as stand-alone R-packages, are generally more flexible and often easily extendable, but they require the intervention of expert users. Usually they require the user to provide annotations and ontologies as input data in more or less common formats. While this enables the full control over data used and guarantees the possibility to use most-updated data, the preparation of input datasets may result in an error-prone waste of time. A possible future direction may regard the development of a comprehensive platform for the integrated semantic analysis of protein interaction networks. Table 5: Packages for R Table 6: Web servers for calculation of semantic similarity measures CONCLUSIONS SS measures, i.e. the quantification of the similarity of two or more terms belonging to the same ontology, is a well established field. The application of SS to proteins as well as to protein interaction data is still a novel field, and there exist many open problems and challenges that should be addressed. In this work, we presented a survey of main SS measures based on GO and the main issues discussed in the scientific community regarding: (i) the assessment of SSs in terms of biological features and (ii) the biases on the calculation of SSs that arise in the biological field. Downloaded from http://bib.oxfordjournals.org/ by guest on March 6, 2014 Semantic similarity analysis of protein data page 15 of 17 The several assessments reported in this work provide a clear vision of the extent to which SS measures correlate with other biological features and similarity measures. Furthermore, we identified some critical points and issues regarding current measures that may stimulate discussion and research in the future. We concluded that Resnik, one of the most considered SS measures, behaves properly most of the times. More recent approaches based on term specificity such as G-SESAME, simgic, simic and TCSS seem to outperform Resnik in several cases. We believe they represent the next generation of SS measures that should be used, since all of them offer improvements over Resnik in different directions, resolving some of the issues presented above. Finally, we point the attention to another problem that is emerging. Recently, semantic similarity measures have been used as input or validation data in several genome-wide and proteome-wide applications (i.e. PPI networks alignment problems), requiring the computation of semantic similarity between whole proteomes. Considering as an example the yeast organism, containing more than 5000 proteins, these applications require the calculation of more than 25 millions of protein similarities. So far, there is only one freely available tool, GS2 [64], that efficiently generates proteome-wide SS scores. Further work is necessary to design faster solutions for the calculation of semantic similarity measures. SUPPLEMENTARY DATA Supplementary data are available online at http:// bib.oxfordjournals.org/. Key Points Downloaded from http://bib.oxfordjournals.org/ by guest on March 6, 2014 Materiale Supplementare Altro Schema Bio- Med e comunque per le scienze sperimentali: - Introduc0on - Methods - Results - Discussion References 6

Received on September 20, 2005; revised on January 16, 2006; accepted on February 3, 2006 Advance Access publication February 21, 2006 Associate Editor: Chris Stoeckert ABSTRACT Motivation: Pathway modeling requires the integration of multiple data including prior knowledge. In this study, we quantitatively assess the application of Gene Ontology (GO)-derived similarity measures for the characterization of direct and indirect interactions within human regulatory pathways. The characterization would help the integration of prior pathway knowledge for the modeling. Results: Our analysis indicates information content-based measures outperform graph structure-based measures for stratifying protein interactions. Measures in terms of GO biological process and molecular function annotations can be used alone or together for the validation of protein interactions involved in the pathways. However, GO cellular component-derived measures may not have the ability to separate true positives from noise. Furthermore, we demonstrate that the functional similarity of proteins within known regulatory pathways decays rapidly as the path length between two proteins increases. Several logistic regression models are built to estimate the confidence of both direct and indirect interactions within a pathway, which may be used to score putative pathways inferred from a scaffold of molecular interactions. Contact: s.guo@wriwindber.org The function of a biological system relies on a combinatory effect of many semantic elements, which interact non-linearly. We need to take a global view of the entire biological network, at many levels of abstraction, to manage complex biological states such as disease. Biological pathways and networks are built upon the identification of protein interactions. Traditionally, information about protein protein interactions is collected from small-scale screening. The accuracy of each interaction is often validated with multiple experiments. With the development of high-throughput methods such as the two-hybrid assay and protein chip technology, the information within interaction databases has increased tremendously (Drewes and Bouwmeester, 2003). In addition, a number of computational methods have been developed for the prediction of protein protein interactions based on protein structure and/or genomic information (Valencia and Pazos, 2002). The increased coverage of the protein protein interaction map provides deeper insight into the global properties of the interaction networks. However, interaction data To whom correspondence should be addressed. Vol. 22 no. 8 2006, pages 967 973 doi:10.1093/bioinformatics/btl042 derived from large-scale assays and computational methods are often very noisy. Thus, it is essential to develop strategies to validate putative protein interactions such that pathways can be rebuilt from a scaffold of reliable molecular interactions (Chen and Xu, 2003). Various genomic features exist in sequence, structure, functional annotation and expression-level databases which may be used for interaction prediction and validation (Valencia and Pazos, 2002). Recently, Lu et al. (2005) have evaluated the predictive power of 16 features, ranging from coexpression relationships to similar phylogenetic profiles. Among those features, semantic similarity between two proteins has the dominant performance in discriminating true interactions from noise. The maximum predictive power is approached by integrating only a few features including the functional similarity of protein pairs. Semantic similarity is traditionally assessed as a function of the shared annotation of proteins in a controlled vocabulary system, such as Gene Ontology (GO) (Sprinzak et al., 2003). GO terms and their relationships are represented in the form of directed acyclic graphs (DAGs). The ontology provides computationally accessible semantics about the gene functions they describe. GO comprises three categories: molecular function (MF), biological process (BP) and cellular component (CC). MF describes activities at the molecular level, andabpisaccomplished byone ormore assemblies ofmf (Ashburner et al., 2000). Although interacting proteins often participate in the same BP, they are less likely to have the same MF. Jansen et al. calculate the similarity of a protein pair by identifying the set of GO terms shared by the two sets of protein annotations (2003). Their method can only use annotations derived from BP subontology, but not MF subontology. In addition, even though two annotations are different, they can be closely related via their common ancestors in DAG. Traditional methods also fail to take into account the specificity of GO terms. Although some proteins share the same GO terms, these terms may be too general to verify the functional association of the annotated proteins. There are two strategies that can be used to overcome these limitations. The first strategy is based on the graph structure of GO. For each protein we may obtain an induced graph which includes the specific set of GO annotations for the protein and all parents of those GO terms. The similarity between two induced graphs can then be used to estimate the similarity between two proteins (Gentleman, 2005, http://www.bioconductor.org/ repository/devel/vignette/govis.pdf). The second strategy is based on the assumption that the more information two terms Ó The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 967 0.5. A realistic classification method must have an AUC larger than 0.5. Curves from different cross-validation runs are averaged by sampling at fixed thresholds, and standard deviations are used to visualize the variability across the runs (Fawcett, 2003). We use the ROC and ROCR libraries in R to draw the graph and calculate the AUCs (Sing et al., 2004). Multiple logistic regression is effective when the response variable is dichotomous and the input variables are continuous, categorical or dichotomous. It is a commonly used model for the prediction of true protein protein interactions (Bader et al., 2004; Lin et al., 2005). The form of the model is p log ¼ b 1 p 0 þ b 1X1 þ b 2X2 þ... þ b kxk ð4þ where p is the probability of a putative interaction to be true and X 1, X2,..., Xk are independent variables such as semantic similarity measures. Logistic regression thus forms a predictor variable log[p/(1 p)] which is a linear combination of the explanatory variables. The values of this predictor variable are then transformed into probabilities by a logistic function. We use the glm function in R to perform the logistic regression. Likelihood ratio test is applied to see if a model including a given independent variable provides more information than a model without this variable. The generalization error and performance of each logistic regression model is estimated by 10-fold cross-validation and ROC curve analysis. Experimentally determined human protein protein interactions have been collected in the Biomolecular Interaction Network Database (BIND) (Bader et al., 2003). Interaction data in BIND are organized into low-throughput (LTP) and high-throughput (HTP) sections based on the number of records in the same publication. HTP data are imported from papers that have more than 40 interaction results arising from the same experimental design and methodology. Examples include those derived from exhaustive 2-hybrid hybridizations, immunoprecipitations and microarray methods. LTP interactions are manually curated from papers with less than 40 interaction results identified by the same method. They include not only data identified by traditional small scale screening, but also two-hybrid assay and other newer approaches. Recently, an approach based on evolutionary cross-species comparisons has emerged for the completion of protein interaction maps (Matthews et al., 2001). Human protein protein interactions may be predicted from lower eukaryotic protein interaction maps through the identification of orthologous genes between different species (Lehner and Fraser, 2004; Brown and Jurisica, 2005). We compare the reliability of the three human protein interaction datasets using Resnik measures. Experimental datasets (LTP and HTP) are downloaded from BIND, and the orthology-inferred dataset (Ortho) is from the core dataset computed by Lehner and Fraser. The reliability of each dataset is estimated by the fraction of interactions with scores more than the defined threshold over all protein protein interactions with corresponding measures available. For BP, MF and CC-derived measures, a different threshold is chosen to achieve maximum accuracy in discriminating true and false interactions for our training dataset described in Section 2.2. The accuracy is the weighted average of true positive and true negative rates. For the logistic regression model, 0.5 is used as the threshold. KEGG Markup Language (KGML) facilitates computational analysis and modeling of protein pathways and networks (Kanehisa et al., 2004). Currently, there are approximately 30 human regulatory pathways with KGML files available. For each pathway, we calculate the semantic similarity values for proteins within the same complex, neighboring proteins and protein pairs with different distance in the pathway. Neighboring pairs represent proteins that directly interact with each other, while distant pairs represent proteins Assessment of semantic similarity measures that interact indirectly through various numbers of bridge proteins. The distance of two proteins is defined as the length of their shortest path in the pathway. Mean similarity values are calculated for each category of protein pairs. Permutation test is used to see how often random chance would generate a mean similarity at least as high as the observed value. For each category, the same number of random pairs is picked from all proteins in the pathways, and the mean similarity value is calculated and compared with the original mean similarity. This process is repeated 1000 times, and the P-value is defined as the frequency that the random dataset generates mean similarity value equal or higher than the original value. In addition, the mean similarity (y) is fitted against the distance (x) with exponential distribution such that the rate of decay may be estimated by mean life of the distribution. X.Guo et al. share, the more similar they are. The shared information is indicated by the information content of the terms that subsume them in DAG. The information content is defined as the frequency of each term, or any of its children, occurring in an annotated dataset. Less frequently occurring terms are said to be more informative. Given the information content of each term, several measures may be calculated to estimate the semantic similarity between annotated proteins (Lord et al., 2003b). Recently, both approaches have been applied in the analysis of protein interactome (Brown and Jurisica, 2005; Chen and Xu, 2004). However, a systematic evaluation of their performance remains to be done. Given the large amount of protein interaction data, we can build a comprehensive scaffold of interactions. One popular paradigm for cellular modeling involves rebuilding pathways from this scaffold. The mining usually uses global data pertaining to molecular and cellular states such as gene expression profiles and protein post-translational modifications. The active subnetworks extracted from the large interaction scaffold may represent concrete hypotheses as to the underlying mechanisms governing the observed state change (Ideker and Lauffenburger, 2003). However, the noisy nature of both high-throughput interactions and state measurements makes pathway modeling extremely difficult. The integration of prior pathway knowledge would increase the reliability of newly inferred pathways. KEGG (Kyoto Encyclopedia of Genes and Genomes) includes current knowledge on molecular interaction networks such as pathways and complexes (Kanehisa et al., 2004). Characterization of KEGG pathways may help us to develop new methods for the pathway modeling. In this study, we quantitatively assess the application of GO-based similarity methods in human protein protein interaction and pathway analysis. First, receiver operating characteristic (ROC) analysis is used to assess the ability of GO graph structure and information content-based methods to stratify protein interactions. For each method, there are three measures in terms of BP, MF or CC annotations. We investigate the possibility to integrate the three measures by logistic regression for performance improvement. Based on the logistic regression model, we then estimate the reliability of several protein protein interaction datasets. More importantly, we characterize semantic similarity of proteins within human regulatory pathways. Several logistic regression models are built to validate indirect protein interactions in a pathway. These models may be used to infer or rank putative pathways given the scaffold of protein interactions. Graph similarity-based measures are estimated using GOstats package of Bioconductor (Gentleman, 2005). Each protein is associated with an induced graph that is obtained by taking the most specific GO terms annotated with the protein and by finding all parents of those terms until the root node has been obtained. Two methods, union-intersection (UI) and longest shared path (LP), are used to calculate the between-graph similarity. The first method uses the number of nodes two induced graphs share divided by the total number of nodes in two graphs. The resulting similarity values are bounded between 0 and 1 with more similar proteins having values near 1. The second method,lp, adopts the depth of the longest path shared by two inducedgraphs asthe similarity score. The largerthe depththe moresimilar two proteins are. If two proteins are both quite specific and similar, they should have long shared path and thus high similarity score. 968 X.Guo et al. Information content-based measures are implemented using a locally installed GO database. We use the associations between GO terms and UniProt-Human (Bairoch et al., 2005) proteins to calculate the information content p(t) which is the frequency of each GO term or any child term occurring within the corpus. Both is-a and part-of links are used to define the child term. Given the information content, we have applied the three measures tocalculatethe semanticsimilaritybetweenterms. Thefirstmeasure (Resnik) is solely based on the information content of shared parents of the two terms. If there is more than one shared parent, the minimum information content is taken. Then the similarity score is derived as shown in Equation (1). simðt1 t2þ ¼ ln ð1þ where S(t1, t2) is the set of parent terms shared by t1 and t2 (Resnik, 1999). Two other measures use not only the information content of the shared parents, but also that of the query terms. Given query terms t1 and t2, the Lin s similarity is defined as 2 ln simðt1 t2þ ¼ ð2þ ln pðt1þþln pðt2þ where p(t1), p(t2) and p(t) are information content values for t1, t2 and their parents, respectively (Lin, 1998). Lin s method generates normalized similarity values between 0 and 1. In contrast, Jiang s method uses the same components for the calculation, but generates semantic distance which can vary between infinity and 0 (Jiang and Conrath, 1997). simðt1 t2þ ¼2 ln lnpðt1þ lnpðt2þ Given those measures, the semantic similarity between two proteins could be derived accordingly. If a protein is annotated with several GO terms, the maximum similarity between all terms is taken as the between protein similarity. All five methods (UI, LP, Resnik, Lin and Jiang) are based on the April 2005 release of GO database. The mappings from Gene IDs to GO IDs can be restricted based on evidence codes. We drop those annotations inferred from physical interaction (IPI) to avoid circular reference. In addition, the annotations associated with BP unknown (GO:0000004), MF unknown (GO:0005554) and CC unknown (GO:0008372) are eliminated from our analysis. These five methods are assessed for their ability to stratify human protein protein interactions. Each method generates three sets of similarity values corresponding to BP, MF and CC categories of GO. The positive dataset is assembled from KEGG. It comprises pairwise interactions among proteins of the same complex and interactions of neighboring proteins within human regulatory pathways. After discarding proteins with indirect interaction effect, the interaction nature of neighboring proteins includes activation, inhibition, binding/association, dissociation, state change, phosphorylation, dephosphorylation, glycosylation, ubiquitination and methylation. As to the negative dataset, we randomly choose two distinct human proteins from Entrez Gene database as a non-interacting protein pair. This is valid since the chance of identifying protein protein interactions at random is very small (0.024% based on the two-hybrid data by Utez et al., 2000). An ROC curve depicts relative trade-offs between sensitivity and specificity of certain method for different values of the threshold. Sensitivity is defined as the ability to identify a true positive in a dataset. Specificity is defined as the ability to identify a true negative in a dataset. The area under an ROC curve (AUC) is generally used as a measure of the performance. It denotes the probability that the classification method will rank a randomly chosen positive instance higher than a randomly chosen negative instance. Random guessing generates the diagonal line y ¼ x, which has an AUC of Ashburner,M. et al. (2000) Gene Ontology: tool for the unification of biology. Nat. Genet., 25, 25 29. Bader,G.D. et al. (2003) BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res., 31, 248 250. Bader,J.S. et al. (2004) Gaining confidence in high-throughput protein interaction networks. Nat. Biotechnol., 22, 78 85. Bairoch,A. et al. (2005) The Universal Protein Resource (UniProt). Nucleic Acids Res., 33, D154 D159. Brown,K.R. and Jurisica,I. (2005) Online predicted human interaction database. Bioinformatics, 21, 2076 2082. Chen,Y. and Xu,D. (2003) Computational analyses of high-throughput protein protein interaction data. Curr. Protein Pept. Sci., 4, 159 181. Chen,Y. and Xu,D. (2004) Global protein function annotation through mining genome-scale data in yeast Saccharomyces cerevisiae. Nucleic Acids. Res., 32, 6414 6424. Deane,C.M. et al. (2002) Protein interactions: two methods for the assessment of the reliability of high-throughput observations. Mol. Cell Proteomics, 1, 349 356. Drewes,G. and Bouwmeester,T. (2003) Global approaches to protein protein interactions. Curr. Opin. Cell Biol., 15, 199 205. ð3þ 7-04- 2014 BIOINFORMATICS ORIGINAL PAPER Systems biology Assessing semantic similarity measures for the characterization of human regulatory pathways Xiang Guo 1,, Rongxiang Liu 2, Craig D. Shriver 3, Hai Hu 1 and Michael N. Liebman 1 1 Windber Research Institute, Windber, PA 15963, USA, 2 GlaxoSmithKline Pharmaceutical R&D, King of Prussia, PA 19420, USA and 3 Walter Reed Army Medical Center, Washington, DC 20307, USA 1 INTRODUCTION Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on March 6, 2014 2 METHODS 2.1 Estimation of semantic similarity 2.2 ROC curve analysis min t 2 Sðt1 t2þ fpðtþg min t 2 Sðt1 t2þ fpðtþg min t 2 Sðt1 t2þ fpðtþg Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on March 6, 2014 2.3 Logistic regression 2.4 Reliability estimation 2.5 Regulatory pathway analysis 3 RESULTS 3.1 Performance of semantic similarity measures for stratifying protein protein interactions We assemble proteins within a complex or neighboring to each other in KEGG regulatory pathways as the positive protein protein interaction dataset (total number 1649). Among them, there are 1500 protein pairs with BP annotations, 1425 pairs with MF annotations and 1255 pairs with CC annotations available for both proteins. The negative dataset with the same number of protein pairs is built by randomly choosing human proteins from Entrez Gene. As shown by the ROC curve analysis, similarity measures based on BP annotation have the highest ability to stratify protein protein interactions (Figs 1 and 2). MF-derived measures follow, and CC-derived measures have the worst discriminating power. Since GO associations with evidence code TAS (Traceable Author Statement) are regarded as the most accurate, we investigate if the performance can be improved by restricting GO annotations to TAS only. Interestingly, no significant improvement is achieved while less protein pairs have similarity values available. While the information on subcellular localizations can be used to define robust negative controls for protein interactions, our analysis indicates that localization-based similarity measures may not have the ability to separate true protein interactions from noise. The reason may be 2-fold. In contrast to the existence of over 9000 BP terms and over 7000 MF terms, the total number of CC terms is only around 1600. This subontology is much less complete and specific compared with the MF and BP subontologies, thus it may not be expressive enough to validate protein protein interactions. The other possible reason is related to the bias in link type usage among the different subontologies. GO terms are placed within a structure of relationships with the link type of is-a between parent and children as well as the type of part-of between part and whole. Generally, only the is-a links are considered for similarity measures (Resnik, 1999), but the omission of the part-of links would result in orphan terms which make the semantic comparison impossible. Our similarity measures consider two links equally, which may not be optimal. The ratio of part-of links versus is-a links is 17% in BP category and there are only 2 part-of links in MF category, but the ratio increases to 70% in CC category. The high percentage of part-of relationships may make the CC-derived measurement less accurate than the other measures. In all three GO categories, the information theoretic methods consistently perform better than graph structure-based methods (Fig. 2). Among the five methods, UI has the worst performance Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on March 6, 2014 expected by chance in terms of BP. In contrast, similarity values of remote protein pairs are not different from those of random pairs in terms of MF and CC. As we know, a series of different functional steps comprise a pathway. Neighboring proteins perform one functional step, while distant proteins may play different functional roles in different cellular location. Our results are consistent with the pathway biology. In addition, CC-derived similarity values decrease in a stepwise pattern, since two or three sequential functional steps are likely to occur in the same cellular compartment. The distance-dependent similarity fits an exponential decay model. The rate of decay is characterized by the mean life, which is the distance needed for the similarity to be reduced by a factor of e. BP, MF and CC-derived similarity values decay rapidly with mean lives of 1.51, 2.42 and 0.81, respectively. Our study has shown that the logistic regression model can be used to separate direct interacting proteins from random protein pairs (Fig. 3). The reliability of a putative interaction may be estimated by this model. Similarly, indirect interacting proteins within a putative pathway may also be validated based on their semantic similarity. Following the same procedure, we have created three models using BP and MF-derived measures to assign confidence scores to protein pairs with distance of 2, 3 or 4 in a pathway. The 10-fold cross-validation shows that the prediction errors of these models are 26.9, 30.5 and 33.5%. Three models have AUC estimates of 0.82 ± 0.03, 0.79 ± 0.06 and 0.77 ± 0.06, respectively. These models may be used together to validate putative pathways by scoring both direct and indirect interactions in the pathway. 4 DISCUSSION Although various functional similarity measures have been used in the interactome analysis, a systematic evaluation of their performance has not been reported. Our results demonstrate that information content-based measures have better performance than GO structure-based measures for the validation of protein interactions involved in human regulatory pathways. Among them, Resnik s approach seems to have the best performance. Measures in terms of either MF or BP can be used to stratify protein interactions. However, CC-derived measures may not be sensitive enough for this purpose. The application of semantic similarity measures relies on the completeness and accuracy of GO annotation. Most of the proteins included in KEGG pathways have accurate and detailed annotation. However, there may be considerable amount of incorrect or underannotated proteins in other databases. The performance of semantic similarity measures may be decreased when applied to a poorly annotated dataset. For example, if two proteins are annotated by a non-specific term signal transducer activity (GO:0004871)only, Lin similarity will be 1, Jiang distance will be 0, while UI, LP and Resnik measures generate low similarity scores. Therefore, in the case of under annotation, Lin and Jiang measures are more likely to generate false positives while more false negatives may be seen in other three measures. As the use of GO improves, the performance of those measures should improve when applied to experimental datasets. Brown and Jurisica (2005) have recently adopted information content-based method to validate their protein interaction datasets. However, their method does not separate the three GO categories. The semantic similarity is determined by the maximum similarity from the set of all GO term pairs between interacting proteins. Our results show that BP-based measures produce higher similarity values than MF and CC-based measures (Fig. 4). If there are BP annotations available for a protein pair, then the similarity value derived from the method of Brown and Jurisica is most likely equal to our BP-based similarity value. Currently, BP annotation is the most comprehensive among the three GO categories. In our dataset, if an MF-based measure is defined for a protein pair, there is a 93% chance that a BP-based measure is also defined. Thus, information included in the MF annotation still remains largely unexplored by the method of Brown and Jurisica. Our results demonstrate that MFderived measures can be used alone or integrated with BP-derived measures for the interactome analysis. Our KEGG pathway analysis indicates that protein pairs with short path length have significantly higher semantic similarity values than expected by chance alone. These protein pairs can be separated from random protein pairs by logistic regression models. Current pathway modeling methods score candidate subnetworks based on various evidence including semantic similarity estimates for each protein interaction (Sharan et al., 2005). However, information about proteins, which interact indirectly through other bridge proteins, has not been utilized for pathway modeling. We propose to calculate confidence scores of not only direct interactions but also indirect interactions for the validation of putative pathways. The logistic regression model is our first step in this direction. Future work may include integration of more genomic features such as mrna coexpression, and the development of a probabilistic model to score the candidate subnetworks based on the confidence values assigned to different protein pairs. We believe that new methods incorporating semantic similarity of proteins that interact directly and indirectly will greatly aid the extraction of active pathways and thus improve the interpretation of intriguing biological phenomenon. ACKNOWLEDGEMENTS We thank Dr Chen Yu of Monsanto Company for stimulating discussions and Nicholas Jacob, President of Windber Research Institute, for continuing support. Conflict of Interest: none declared. REFERENCES Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on March 6, 2014 969 972 7