Logic-based information integration

Documenti analoghi
Constant Propagation. A More Complex Semilattice A Nondistributive Framework

Finite Model Theory / Descriptive Complexity: bin

Fiori di campo. Conoscere, riconoscere e osservare tutte le specie di fiori selvatici più note

Graphs: Cycles. Tecniche di Programmazione A.A. 2012/2013

Single-rate three-color marker (srtcm)

How to register for exam sessions ( Appelli ) Version updated on 17/10/2018

College Algebra. Logarithms: Denitions and Domains. Dr. Nguyen November 9, Department of Mathematics UK

Testi del Syllabus. Docente CAGNONI STEFANO Matricola:

Accesso Mul*plo - modelli

WEB OF SCIENCE. COVERAGE: multidisciplinary TIME RANGE: DOCUMENT TYPES: articles, proceedings papers, books

MOMIS: Mediator. nment for Multiple Information. Sources. DB unimo. Laura Po, Antonio Sala

Azioni e proposte SG7/SG8

A.A. 2006/2007 Laurea di Ingegneria Informatica. Fondamenti di C++ Horstmann Capitolo 3: Oggetti Revisione Prof. M. Angelaccio

LA SACRA BIBBIA: OSSIA L'ANTICO E IL NUOVO TESTAMENTO VERSIONE RIVEDUTA BY GIOVANNI LUZZI

UNIVERSITÀ DEGLI STUDI DI TORINO

Database support Prerequisites Architecture Driver features Setup Stored procedures Where to use. Contents

Resources and Tools for Bibliographic Research. Search & Find Using Library Catalogues

How to register online for exams (Appelli) Version updated on 23/10/2017

Fondamenti di Teoria delle Basi di Dati

Online Resource 12: Dataset

Un introduzione al Progetto SOCS: formalizzazione e verifica di protocolli di comunicazione.

Technical Guidelines GON % Italian production. sports car oriented

Enel App Store - Installation Manual - Mobile

LA CLASSIFICAZIONE DELLE OPERAZIONI IN PARTENARIATO PUBBLICO PRIVATO NELL AMBITO DELLE REGOLE EUROPEE

LA SACRA BIBBIA: OSSIA L'ANTICO E IL NUOVO TESTAMENTO VERSIONE RIVEDUTA BY GIOVANNI LUZZI

INTERNET & MARKETING INNOVATIVE COMMUNICATION.

Scheduling. Scheduler. Class 1 Class 2 Class 3 Class 4. Scheduler. Class 1 Class 2 Class 3 Class 4. Scheduler. Class 1 Class 2 Class 3 Class 4

sottobasi per valvole a spola

StrumenJ semanjci per la ricerca applicata ai tram funzionali: sviluppo e applicabilità dei Thesauri

Ammissibilità di co.co.co. e AdR in H2020. Laura Fulci Dirigente Area Ricerca Politecnico di Torino

The new Guglielmo s project comes from the aim to offer

6.5 RNA Secondary Structure

LA SACRA BIBBIA: OSSIA L'ANTICO E IL NUOVO TESTAMENTO VERSIONE RIVEDUTA BY GIOVANNI LUZZI

Sosteniamo le ambizioni del middle market

LA SACRA BIBBIA: OSSIA L'ANTICO E IL NUOVO TESTAMENTO VERSIONE RIVEDUTA BY GIOVANNI LUZZI

SRT064 BTH SRT051 BTH SRT052 BTH

A Solar Energy Storage Pilot Power Plant

Politecnico di Torino. PRAISE & CABLE projects interaction

Estendere Lean e Operational Excellence a tutta la Supply Chain

CORSO VM6SKO: VMware vsphere: Skills for Operators [V6] CEGEKA Education corsi di formazione professionale

General info on using shopping carts with Ingenico epayments

LA SACRA BIBBIA: OSSIA L'ANTICO E IL NUOVO TESTAMENTO VERSIONE RIVEDUTA BY GIOVANNI LUZZI

CORSO MOC6439: Configuring and Troubleshooting Windows Server 2008 Application Infrastructure

Discrete Parabolic Anderson Model with Heavy Tailed Potential

AVERE 30 ANNI E VIVERE CON LA MAMMA BIBLIOTECA BIETTI ITALIAN EDITION

Le piccole cose che fanno dimagrire: Tutte le mosse vincenti per perdere peso senza dieta (Italian Edition)

GstarCAD 2010 Features

Il concetto di varietà e la nascita della teoria degli insiemi RIEMANN ( )

Introduzione alla storia dell intelligenza artificiale e della robotica

6.5 RNA Secondary Structure. 18 novembre 2014

MINISTERO DELL'ISTRUZIONE DELL'UNIVERSITÀ E DELLA RICERCA

19 touchscreen display

Maps. a.k.a, associative array, map, or dictionary

I CAMBIAMENTI PROTOTESTO-METATESTO, UN MODELLO CON ESEMPI BASATI SULLA TRADUZIONE DELLA BIBBIA (ITALIAN EDITION) BY BRUNO OSIMO

100 consigli per vivere bene (Italian Edition)

Gocce d'anima (Italian Edition)

Corso di Laurea in FISICA Dispositivi di calcolo II

IL GIOVANE HOLDEN FRANNY E ZOOEY NOVE RACCONTI ALZATE LARCHITRAVE CARPENTIERI E SEYMOUR INTRODUZIONE BY JD SALINGER

zpcr in practice Fabio Massimo Ottaviani EPV

Self-Calibration Hands-on CASA introduction

AVVISO n Febbraio 2013 Idem. Mittente del comunicato : Borsa Italiana. Societa' oggetto dell'avviso

Si faccia riferimento all Allegato A - OPS 2016, problema ricorrente REGOLE E DEDUZIONI, pagina 2.

Selection procedure. MASTER IN BUSINESS ADMINISTRATION - MBA (LM-77, 2-year postgraduate degree) AY 2017/18

Il Piccolo Principe siamo noi: Adattamento teatrale per la scuola primaria (ABW. Antoine de Saint- Exupery) (Volume 1) (Italian Edition)

Infrastrutture critiche e cloud: una convergenza possibile

CORSO MOC10231: Designing a Microsoft SharePoint 2010 Infrastructure. CEGEKA Education corsi di formazione professionale

Access to HPC resources in Italy and Europe

6.6 ARIMA Model. Autoregressive Integrated Moving Average (ARIMA ) Model. ARIMA(p, d, q) Process

MOOD-TC. a Multilingual Ontology Driven Text Classifier

Introduzione all ambiente di sviluppo

Get Instant Access to ebook D Lgs PDF at Our Huge Library D LGS PDF. ==> Download: D LGS PDF

Syllabus Attività Formativa

E-Business Consulting S.r.l.

Copyright 2012 Binary System srl Piacenza ITALIA Via Coppalati, 6 P.IVA info@binarysystem.eu

La competenza informativa (Information Literacy) nella. Stefania Puccini & Ornella Russo SDIAF - Firenze 26 novembre 2015

Una Ricerca Erboristica (Italian Edition)

Semi di contemplazione (L'educazione interiore) (Italian Edition)

Introduction. The Structure of a Compiler

NATIONAL SPORT SCHOOL

Pubblicazioni COBIT 5

IP TV and Internet TV

Ingegneria del Software

Testi del Syllabus. L insegnamento è composto dai seguenti moduli: ORGANIZZAZIONE AZIENDALE 7 CFU I FONDAMENTALI DI CONTROLLO DI GESTIONE 7 CFU

Downloading and Installing Software Socio TIS

LA SACRA BIBBIA: OSSIA L'ANTICO E IL NUOVO TESTAMENTO VERSIONE RIVEDUTA BY GIOVANNI LUZZI

Keep calm, observe and assess

R.M.Dorizzi, Strumenti per interpretare gli esami di laboratorio? Forlì, 9 ottobre 2012

Seminario Scientifico Innovazione tecnologica per la mobilità ferroviaria Giovedì 6 giugno 2013 ore 9.00

NOTICE. Palladium Securities 1 S.A. Series 112 EUR 100,000,000 Fixed to Floating Rate Instruments due 2023 (with EUR

S C.F.

Get Instant Access to ebook Guida Agli Etf PDF at Our Huge Library GUIDA AGLI ETF PDF. ==> Download: GUIDA AGLI ETF PDF

MOC10982 Supporting and Troubleshooting Windows 10

PROGETTAZIONE ROBUSTA

Appendice E - Appendix E PANNELLI FOTOVOLTAICI - PHOTOVOLTAIC PANELS

SCHEDA INSEGNAMENTO A.A. 2017/2018

Workshop Energy Data Reggio Emilia, 21 febbraio 2017

Transcript:

Logic-based information integration Giuseppe De Giacomo & Riccardo Rosati Dipartimento di Informatica e Sistemistica Antonio Ruberti Università di Roma La Sapienza Joint work with Andrea Calì, Diego Calvanese, Maurizio Lenzerini, Domenico Lembo, Moshe Y. Vardi (Rice Univ.), Guido Vetere (IBM Italia) Course at ESSLLI 05 Edinburgh, UK August 2005

Logic-based information integration: overview Introduction to information (data) integration De Giacomo Query answering in GAV and LAV information integration systems De Giacomo Information integration under constraints: basic techniques Rosati Information integration under constraints: results Rosati Inconsistency tolerance in information integration Rosati (see also Bertossi s ESSLLI 05 course) De Giacomo & Rosati ESSLLI 05 Logic-based information integration 1

Lecture 1: outline Introduction to information (data) integration Data integration: logical formalization De Giacomo & Rosati ESSLLI 05 Logic-based information integration 2

Information integration Answer(Q) Query Global schema Sources De Giacomo & Rosati ESSLLI 05 Logic-based information integration 3

Information integration Query Global schema R 1 C 1 D 1 T 1 Mapping c 1 d 1 t 1 c 2 d 2 t 2 Source schema Source schema Source 1 Source 2 De Giacomo & Rosati ESSLLI 05 Logic-based information integration 4

An example author Paper Researcher cites Selfcitation(x)Ã z, y. cite(x,y) author(z,x) author(z,y) Selfcitation Selfcitation: contains papers that cite (other) papers by the same authors De Giacomo & Rosati ESSLLI 05 Logic-based information integration 5

IT hype The current trend in IT industry is operating in on-demand environments. Operating on-demand is based on three key elements: Integration: Integration creates the necessary business flexibility to optimize operations across and beyond the enterprise. Automation: Automation reduces the complexity and cost of IT management and improves the availability and the resilience. Virtualization: Virtualization provides a single consolidated view of (and easy access to) all available resources, which improves working capital and asset utilization. De Giacomo & Rosati ESSLLI 05 Logic-based information integration 6

Information integration Information integration is the problem of providing unified and transparent access to a set of autonomous and heterogeneous sources Growing market One of the major challenges for the future of IT At least two contexts Intra-organization information integration (e.g., EIS) Inter-organization information integration (e.g. integration on the Web) De Giacomo & Rosati ESSLLI 05 Logic-based information integration 7

The two contexts Intra-organization information integration Data warehousing vs information on demand Known and accessible data sources Global schema (ontology) Mapping between sources and global schema Inter-organization information integration Autonomous conceptualization P2P resource sharing Sharing of ontologies (or, mappings between ontologies) Needs integration infrastructures De Giacomo & Rosati ESSLLI 05 Logic-based information integration 8

Information integration: available industrial technologies Distributed database systems Information on demand Tools for source wrapping Tools based on database federation, e.g., DB2 Information Integrator Distributed query optimization De Giacomo & Rosati ESSLLI 05 Logic-based information integration 9

Integrated access to distributed data Different approaches/architectures: distributed databases data sources are homogeneous databases under the control of the distributed database management system multidatabase or federated databases data sources are autonomous, heterogeneous databases; procedural specification information (data) integration access through a global schema mapped to autonomous and heterogeneous data sources; declarative specification peer-to-peer data integration network of autonomous systems mapped one to each other, without a global schema; declarative specification De Giacomo & Rosati ESSLLI 05 Logic-based information integration 10

Database federation tools: characteristics Physical transparency (masking from the user the physical characteristics of the sources) Heterogeinity (federating highly diverse types of sources) Extensibility Autonomy of data sources Performance (distributed query optimization) However, current tools do not (directly) support logical or conceptual transparency De Giacomo & Rosati ESSLLI 05 Logic-based information integration 11

Logical transparency Basic ingredients for achieving logical transparency: The global schema (ontology) provides a conceptual view that is independent from the sources The global schema is described with a semantically rich formalism The mappings are the crucial tools for realizing the independence of the global schema from the sources Obviously, the formalism for specifying the mapping is also a crucial point All the above aspects are not appropriately dealt with by current tools. This means that information integration cannot be simply addressed on a tool basis. De Giacomo & Rosati ESSLLI 05 Logic-based information integration 12

Two variants of information integration (Mediator-based) information integration queries over the global schema this course Data exchange materialization of the global view see [Fagin&al. ICDT 03, Kolaitis PODS05,...] P2P data integration several peers each peer with local and external sources queries over one peer [Halevy&al. ICDE 03], Calvanese &al. PODS 04, Calvanese &al. DBPL 05,...] De Giacomo & Rosati ESSLLI 05 Logic-based information integration 13

Information integration Queries over the global schema Answer(Q) Query Global schema Sources De Giacomo & Rosati ESSLLI 05 Logic-based information integration 14

Data exchange Materialization of the global view Materialize Target Source De Giacomo & Rosati ESSLLI 05 Logic-based information integration 15

Peer-to-peer information integration P 1 P 2 P 3 P4 P 5 Operations: - Answer(Q, P i ) - Materialize(P i ) Peer Peer schema Local source External source Local mapping P2P mapping De Giacomo & Rosati ESSLLI 05 Logic-based information integration 16

Lecture 1: outline Introduction to information (data) integration Data integration: logical formalization De Giacomo & Rosati ESSLLI 05 Logic-based information integration 17

Information integration Answer(Q) Query Global schema Sources De Giacomo & Rosati ESSLLI 05 Logic-based information integration 18

Main problems in data integration 1. How to construct the global schema 2. (Automatic) source wrapping 3. How to discover mappings between the sources and the global schema 4. Limitations in the mechanisms for accessing the sources 5. Data extraction, cleaning and reconciliation 6. How to process updates expressed on the global schema, and updates expressed on the sources ( read/write vs read-only data integration) 7. The modeling problem: How to model the mappings between the sources and the global schema 8. The querying problem: How to answer queries expressed on the global schema 9. Query optimization De Giacomo & Rosati ESSLLI 05 Logic-based information integration 19

Formal framework for data integration A data integration system I is a triple G, S, M, where G is the global schema The global schema is a logical theory over an alphabet A G S is the source schema The source schema is constituted simply by an alphabet A S disjoint from A G M is the mapping between S and G Different approaches to the specification of mapping De Giacomo & Rosati ESSLLI 05 Logic-based information integration 20

Semantics of a data integration system Which are the databases that satisfy I, i.e., which are the logical models of I? The databases that satisfy I are logical interpretations for A G (called global databases). We refer only to databases over a fixed infinite domain Γ of constants. Let C be a source database over Γ (also called source model), fixing the extension of the predicates of A S (thus modeling the data present in the sources). The set of models of (i.e., databases for A G that satisfy) I relative to C is: sem C (I) = { B B is a G-model (i.e., a global database that is legal wrt G) and is an M-model wrt C (i.e., satisfies M wrt C) } What it means to satisfy M wrt C depends on the nature of the mapping M. De Giacomo & Rosati ESSLLI 05 Logic-based information integration 21

Semantics of queries to I A query q of arity n is a formula with n free variables. If D is a database, then q D denotes the extension of q in D (i.e., the set of n-tuples that are valuations in Γ for the free variables of q that make q true in D). If q is a query of arity n posed to a data integration system I (i.e., a formula over A G with n free variables), then the set of certain answers to q wrt I and C is cert(q, I, C) = {(c 1,..., c n ) q B B sem C (I)}. Note: query answering is logical implication. Note: complexity will be mainly measured wrt the size of the source database C, and will refer to the problem of deciding whether c cert(q, I, C), for a given c. De Giacomo & Rosati ESSLLI 05 Logic-based information integration 22

Databases with incomplete information, or Knowledge Bases Traditional database: one model of a first-order theory Query answering means evaluating a formula in the model Database with incomplete information, or Knowledge Base: set of models (specified, for example, as a restricted first-order theory) Query answering means computing the tuples that satisfy the query in all the models in the set There is a strong connection between query answering in data integration and query answering in databases with incomplete information under constraints (or, query answering in knowledge bases). De Giacomo & Rosati ESSLLI 05 Logic-based information integration 23

Query answering with incomplete information [Reiter 84]: relational setting, databases with incomplete information modeled as a first order theory [Vardi 86]: relational setting, complexity of reasoning in closed world databases with unknown values Several approaches both from the DB and the KR community [van der Meyden 98]: survey on logical approaches to incomplete information in databases De Giacomo & Rosati ESSLLI 05 Logic-based information integration 24

The mapping How is the mapping M between S and G specified? Are the sources defined in terms of the global schema? Approach called source-centric, or local-as-view, or LAV Is the global schema defined in terms of the sources? Approach called global-schema-centric, or global-as-view, or GAV A mixed approach? Approach called GLAV De Giacomo & Rosati ESSLLI 05 Logic-based information integration 25

GAV vs LAV example Global schema: movie(title, Year, Director) european(director) review(title, Critique) Source 1: r 1 (Title, Year, Director) since 1960, European directors Source 2: r 2 (T itle, Critique) since 1990 Query: Title and critique of movies in 1998 D. movie(t, 1998, D) review(t, R), written { (T, R) movie(t, 1998, D) review(t, R) } De Giacomo & Rosati ESSLLI 05 Logic-based information integration 26

Formalization of LAV In LAV (with sound sources), the mapping M is constituted by a set of assertions: s φ G one for each source element s in A S, where φ G is a query over G of the arity of s. Given source database C, a database B for G satisfies M wrt C if for each s S: s C B φ G In other words, the assertion means x (s( x) φ G ( x)). The mapping M and the source database C do not provide direct information about which data satisfy the global schema. Sources are views, and we have to answer queries on the basis of the available data in the views. De Giacomo & Rosati ESSLLI 05 Logic-based information integration 27

LAV example Global schema: movie(title, Year, Director) european(director) review(title, Critique) LAV: associated to source relations we have views over the global schema r 1 (T, Y, D) { (T, Y, D) movie(t, Y, D) european(d) Y 1960 } r 2 (T, R) { (T, R) movie(t, Y, D) review(t, R) Y 1990 } The query { (T, R) movie(t, 1998, D) review(t, R) } is processed by means of an inference mechanism that aims at re-expressing the atoms of the global schema in terms of atoms at the sources. In this case: { (T, R) r 2 (T, R) r 1 (T, 1998, D) } De Giacomo & Rosati ESSLLI 05 Logic-based information integration 28

Formalization of GAV In GAV (with sound sources), the mapping M is constituted by a set of assertions: g φ S one for each element g in A G, where φ S is a query over S of the arity of g. Given source database C, a database B for G satisfies M wrt C if for each g G: g B C φ S In other words, the assertion means x (φ S ( x) g( x)). Given a source database, M provides direct information about which data satisfy the elements of the global schema. Relations in G are views, and queries are expressed over the views. Thus, it seems that we can simply evaluate the query over the data satisfying the global relations (as if we had a single database at hand). De Giacomo & Rosati ESSLLI 05 Logic-based information integration 29

GAV example Global schema: movie(title, Year, Director) european(director) review(title, Critique) GAV: associated to relations in the global schema we have views over the sources movie(t, Y, D) { (T, Y, D) r 1 (T, Y, D) } european(d) { (D) r 1 (T, Y, D) } review(t, R) { (T, R) r 2 (T, R) } De Giacomo & Rosati ESSLLI 05 Logic-based information integration 30

GAV example (constraints) see more later Global schema containing constraints: movie(title, Year, Director) european(director) review(title, Critique) european movie 60s(Title, Year, Director) T, Y, D. european movie 60s(T, Y, D) movie(t, Y, D) D. T, Y. european movie 60s(T, Y, D) european(d). GAV mappings: european movie 60s(T, Y, D) { (T, Y, D) r 1 (T, Y, D) } european(d) { (D) r 1 (T, Y, D) } review(t, R) { (T, R) r 2 (T, R) } De Giacomo & Rosati ESSLLI 05 Logic-based information integration 31

The query GAV example of query processing { (T, R) movie(t, 1998, D) review(t, R) } is processed by means of unfolding, i.e., by expanding each atom according to its associated definition in M, so as to come up with source relations. In this case: movie(t,1998,d) review(t,r) unfolding r 1 (T,1998,D) r 2 (T,R) De Giacomo & Rosati ESSLLI 05 Logic-based information integration 32

GAV and LAV comparison LAV: (Information Manifold, DWQ) Quality depends on how well we have characterized the sources High modularity and extensibility (if the global schema is well designed, when a source changes, only its definition is affected) Query processing needs reasoning (query answering complex) GAV: (Carnot, SIMS, Tsimmis, IBIS, Momis, DisAtDis,... ) Quality depends on how well we have compiled the sources into the global schema through the mapping Whenever a source changes or a new one is added, the global schema needs to be reconsidered Query processing can be based on some sort of unfolding (query answering looks easier without constraints) De Giacomo & Rosati ESSLLI 05 Logic-based information integration 33

Beyond GAV and LAV: GLAV In GLAV (with sound sources), the mapping M is constituted by a set of assertions: φ S φ G where φ S is a query over S, and φ G is a query over G of the arity φ S. Given source database C, a database B that is legal wrt G satisfies M wrt C if for each assertion in M: φ S C φ G B In other words, the assertion means x (φ S ( x) φ G ( x)). As for LAV, the mapping M does not provide direct information about which data satisfy the global schema: to answer a query q over G, we have to infer how to use M in order to access the source database C. De Giacomo & Rosati ESSLLI 05 Logic-based information integration 34

Example of GLAV Global schema: W ork(p erson, P roject), Area(P roject, F ield) Source 1: Source 2: Source 3: HasJob(P erson, F ield) T each(p rof essor, Course), In(Course, F ield) Get(Researcher, Grant), F or(grant, P roject) GLAV mapping: { (r, f) HasJob(r, f) } { (r, f) W ork(r, p) Area(p, f) } { (r, f) T each(r, c) In(c, f) } { (r, f) W ork(r, p) Area(p, f) } { (r, p) Get(r, g) F or(g, p) } { (r, p) W ork(r, p) } De Giacomo & Rosati ESSLLI 05 Logic-based information integration 35

GLAV: a technical observation In GLAV (with sound sources), the mapping M is constituted by a set of assertions: φ S φ G Each such assertion can be rewritten wlog by introducing a new predicate r (not to be used in the queries) of the same arity as the two queries and replace the assertion with the following two: φ S r r φ G In other words, we replace x (φ S ( x) φ G ( x)) with x (φ S ( x) r( x)) and x (r( x) φ G ( x)). De Giacomo & Rosati ESSLLI 05 Logic-based information integration 36