Author

Liangyou Chen

Advisor

Hodges, Julia E.

Committee Member

Jamil, Hasan M.

Committee Member

Hansen, Eric

Committee Member

Banicescu, Ioana

Committee Member

Diehl, Walter J.

Date of Degree

1-1-2004

Document Type

Dissertation - Open Access

Major

Computer Science

Degree Name

Doctor of Philosophy

Abstract

This dissertation provides an ad hoc integration methodology to manage and integrate heterogeneous online distributed databases on demand. The problem arises from an impending demand from scientific users to conveniently manage existing Web data along with the complexity involved in the construction of a functional data federation system using existing data integration technologies. We close this gap with a databases management framework accompanying novel Web data specification languages, wrapper generation technologies, and distributed query processing techniques. A major achievement of this dissertation is the establishment of a sound relational data model for Web data. Under this model, the Web becomes a synthetic extension of the traditional database systems. Consequently, a novice user of our system can cheaply integrate a large number of distributed Web sources with in-house databases for daily scientific data analysis purpose. The relational Web modeling leads to a practical ad hoc integration system - the Meteoroid system (a MEthodology for ad hoc inTEgration of Online distributed heteROgeneous Internet Data) - in the context of biological data interoperability. We identify that a main difficulty for ad hoc integration lies in the lack of a fully automated wrapper generation and maintenance technique for general semi-structured data such as HTML, XML and plain text documents. We address this issue through a thorough study of characteristics of online Web data and devise various automated wrapper techniques to facilitate robust data wrapping tasks. With this technique, form-based Web data and table-based Web data can be treated like traditional relational databases. A seamless interoperation environment for Web data and in-house databases is possible. Another difficulty impeding ad hoc integration is in the query processing for heterogeneous distributed sources, where conflict of data is common and on demand mediation of distributed sources is desirable. The dynamicity and unpredictability of Web data further complicate the query processing task. We studied limitations posed by the Web environment for integration query processing and developed innovative techniques to expedite the early appearance of available results. Finally we demonstrate a prototype system for ad hoc integration of heterogeneous biological data. In the system, visual Web-based interfaces guide the integration of heterogeneous data for novice users. A declarative environment is supported for ad hoc querying and management of distributed data sources.

URI

https://hdl.handle.net/11668/20256

Share

COinS