BIOINFORMATICA: Cosa è?
THE DEFINITIONS OF BIOINFORMATICS Bioinformatics is an integration of mathematical, statistical and computer methods to analyse biological, biochemical and biophysical data (Georgia Inst of Tech., USA) Bioinformatics is the study of biological information as is passes from its storage site in the genome to the various gene products in the cell. it involves the creating and development of advanced information and computational technologies for problems in molecular biology (Stanford University, USA) Bioinformatics specifically refers to the search and use of patterns and structure in biological data and the development of new methods for database access.. Computational biology is more frequently used to refer to physical and mathematical simulation of biological processes (Virginia Inst Tech., USA)
THE GENOMIC ERA The genomic era has produced a massive explosion in the amount of biological information Eukaryota 91 Bacteria 1155 Archaea 37
Perché nasce la Bioinformatica HGP: Iniziò formalmente nel 1990 e doveva durare 15 anni ma la rapidità degli avanzamenti tecnologici ne accelerarono il completamento al 2003 www.ncbi.nlm.nih.gov/genbank/genbankstats.html Il progetto genoma ha stimolato il miglioramento delle tecnologie di sequencing. Il costo e sceso, i tempi sono diminuiti: il numero di dati prodotti e esploso
Perché nasce la Bioinformatica Eukaryota 91 Bacteria 1155 Archaea 37 Chi si occupa di organizzare, analizzare e rendere accessibili tutti questi dati? BIOINFORMATICA Computers can be used to gather, store, analyse biological data
Contenere, Organizzare e Rendere accessibili i dati biologici Bioinformatics: DATABASES Memorizzazione accurata, organizzazione, indicizzazione e mantenimento di informazioni biologiche Comunicazione e integrazione fra database Aggiornamento dell informazione Esempi (Sequenze proteiche e nucleotidiche, genomi, strutture, malattie,...)
Contenere, Organizzare e Rendere accessibili i dati biologici Altre fonti di dati biologici cosiddette high-throughput Proteomica Trascrittomica Interattomica E poi: genomica strutturale, biologia dei sistemi...
Analizzare i dati biologici e costruire strumenti predittivi Bioinformatics: TOOLS ricerca di similarità tra sequenze (ricerca di omologia funzionale) ricerca di geni nelle sequenze di DNA (decifrazione) ricerca di motivi funzionali nel DNA (es. siti di binding per fattori di trascrizione) nel RNA (strutture secondarie) e nelle proteine (domini) analisi dei genomi e loro comparazione allineamento multiplo di sequenze e analisi filogenetica analisi di dati strutturali 3D di PROTEINE; predizione della struttura di proteine analisi dei risultati di esperimenti con microarray...
Cosa è la bioinformatica La bioinformatica è una disciplina scientifica dedicata alla risoluzione di problemi biologici con metodi informatici. Applicazione di metodiche informatiche per l acquisizione, la gestione e l analisi di dati biologici
BIOINFORMATICA: Un po di storia
It was born at the end of the 70 s when the first nucleotidic sequences were published The first databases End of 70 s: NBRF (National Biomedical Research Foundation) database: a collection of clusters of homologues proteins 1981: first release of the EMBL (European Molecular Biology Laboratory Heidelberg) database: A library of DNA and RNA sequences 1982: first release of the GeneBank (USA) database: A library of DNA and RNA sequences 1986: first release of the DDBJ (Japan) database: A library of DNA and RNA sequences Half of the 80 s: first specialized databases (PROSITE, PDB..) Nowadays: more than 1200 biological databases are available on the web
The first bioinformatics tools 1970: an algorithm for the research of the best global alignment of two sequences was published 1981: an algorithm for the research of the best local alignment of two sequences was published 1983: an algorithm for the research of similarities in databases was published 1985: FASTA 1990: BLAST Two of the most used programs for the search of sequence similarities Nowadays: a lot of algorithms were published 1) for sequence comparison 2) to build phylogenetic trees 3) to predict secondary and tertiary protein structures 4) to predict protein-protein, protein-inhibitor interactions
BIOINFORMATICA: Database
I database o o raccolgono informazioni e dati derivati dalla letteratura e da analisi effettuate in laboratorio oppure attraverso l applicazione di analisi bioinformatiche o analisi in silico. sono generalmente accessibili liberamente e possono essere consultati via web.
Archivio dati Prima di parlare di database introduciamo gli archivi dati Molte informazioni e molti dati necessità di memorizzarli e conservarli Prima dell avvento del PC le informazioni venivano memorizzate su supporti fisici quali la carta Come? predisponendo apposite strutture di conservazione dei dati Esempio: registri, quaderni di appunti limiti: la memorizzazione dei dati è sequenziale e non permette un ordinamento specifico Sistema più evoluto: Schedario L elemento principale è la scheda che caratterizza ogni elemento dell archivio
Bioinformatics in the post-sequence era Nature Genetics 33, 305-310 (2003) M. Kanehisa & P. Bork
General types of databases Primary Raw and non-processed data Secondary Curated data chosen from criteria E.g non-redundance, fold Tertiary Data processed HMM profile
Banche dati di sequenze nucleotidiche Sono i contenitori di tutti i dati sperimentali prodotti nel mondo e resi disponibili alla comunità scientifica. P.es. i db di sequenze geniche contengono questi dati e informazioni generiche correlate (laboratorio dove è avvenuto il sequenziamento, data, specie, descrizione ) EMBL datalibrary GenBank DDBJ Europa USA Giappone I tre database si aggiornano quotidianamente scambiandosi i dati ricevuti durante la giornata, in modo che sia sufficiente interrogare solo uno dei tre.
Banche dati di sequenze proteiche These libraries collect both protein sequences obtained through experimental determination and protein sequences derived from the translation of nucleotide sequences (which were predicted or determined to codify for a protein) GeneBank DDBJ EMBL database Amino acid sequence determined through experimental analysis Nucleotide sequences databases Validated enriched of specific information Protein sequences databases PIR TrEMBL SWISSPROT
PIR (Protein Information Resource) http://pir.georgetown.edu Developed by the Georgetown University (USA) and the MIPS (Munich) Good quality of the annotations for proteins and good level of update but isolated from the other resources Before 2002 SWISSPROT Developed by the Swiss Institute of Bioinformatics (SIB) (Swizerland) data A high level of annotation, minimal level of redundancy and high level of integration with other databases TrEMBL (Translated EMBL Nucleotide Sequence Data Library) http://www.expasy.org/swissprot/ Developed by the EBI (UK) A computer-annotated supplement of Swiss-Prot that contains all the translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot
PIR After 2002 People who curated the SWISSPROT (SIB) TrEMBL (EBI) joined their forces for the creation of UNIPROT Knowledgebase The Universal Protein resource http://www.expasy.uniprot.org/ The central database of protein sequences with accurate, consistent, rich sequence and functional annotation
The UNIPROT database structure The Swiss-Prot protein knowledgebase After manual revision The TrEMBL FEATURES: As much annotation information as possible (periodically update the annotations, make use of external experts of groups of proteins who report their comments) Minimal redundancy (Many databases contain for a given protein more than one entry, which correspond to different literature reports. In Swiss-Prot the data for a protein are merged into one entry) Integration with other databases (Integration with other nucleic, protein and structure databases is guaranteed: links to more than 50 different databases) Documentation which explains the database FEATURES: Automatic annotation (Infomation is transferred from the well characterized sequences in Swiss-prot to these entries) Redundancy removal (Sequences of the same organism with an Id=100% are merged)
Banche dati genomiche A genome sequences database collects data related to the genomic mapping and the genomic sequencing of one or more organisms. Three kinds of genome databases Databases which collect data of all the sequenced genomes Databases specific for one organism with sequenced genome i.e. Entrez_Genomes (NCBI) EBI_Genome (EBI) Databases which collect data of a category of organisms with sequenced genomes i.e. Bacterial genomes (Comprehensive microbial research at the TIGR database) i.e. GadFly or FlyBASE (Fruit Fly) MGD (mouse) RGD (rat) GDB and Ensemble (Human)
Useful addresses: Database Genome(s) Address Entrez_genome All sequenced genomes http://www.ncbi.nlm.nih.gov/ EBI_Genomes TIGR All sequenced genomes Sequenced bacterial genomes http://www.ebi.ac.uk/genomes/ http://www.tigr.org/ GadFly Fruit fly http://www.gadfly.org/ FlyBASE Fruit fly http://www.flybase.org/ MGD Mouse http://www.informatics.jax.org/ RGD Rat http://rgd.mcw.edu/ Ensemble Human http://www.ensembl.org/ GDB Human http://gdbwww.gdb.org/
The common feature of genomic databases The possibility to download all the sequences (nucleic acids sequences) of the genome or a part of them (i.e. a specific chromosome) Most of the genomic databases developed also the correspondent proteomic resource (The complete ensemble of proteins obtained by the translation of the genome for a given organism)
Banche dati di strutture The PDB is the single worldwide repository for the processing and distribution of 3-D structure data of large biological molecules. Curated by The PDB is managed by the Research Collaboratory for Structural Bioinformatics (RCSB) consortium: Rutgers, the State University of New Jersey San Diego Supercomputer Center (SDSC) The accession numbers: 4 characters 1fbl, 12ca
PDB Statistics 68562 structures deposited
PDB redundancy Since any new solved structure can be deposited in the PDB, more than one structure can be available for the same protein (different resolution, different spectroscopic techniques) 68562 structures deposited Sequence Identity Number of nonredundant chains 90% 28176 50% 23708 30% 19241
PDB related sources MSD PDBsum MMDB http://www.ebi.ac.uk/msd/ http://www.ebi.ac.uk/thornton-srv/databases/pdbsum/ http://www.ncbi.nlm.nih.gov/
Quanti database ci sono? Nucleic Acids Research (NAR) pubblica tutti gli anni un numero dedicato ai database e mantiene un archivio di tutti i database a loro noti Dati 2006 Nel 2006 erano ca. 860 Nel 2010 sono 1230 http://www.oxfordjournals.org/nar/database/a/
I database proteicisecondari Contengono il risultato di analisi eseguite sulle sequenze contenute nei database primari per arricchire il dato di informazioni utili. Esempio. Da SWISSPROT, database primario di sequenze di aminoacidi, sono stati ricavati i database secondari Prosite e Pfam, nei quali si da maggior rilievo alla classificazione delle famiglie e dei domini proteici. http://au.expasy.org/prosite/ Database of protein families and domains http://www.sanger.ac.uk/software/pfam/ Protein families database of alignments and HMMs