Ongoing transformation of biology from a data-poor science into an increasingly data-rich science, with the attendant increase in the number, size, and diversity of sources of data (e.g., protein sequences, structures, expression patterns, interactions) offer unprecedented, and as yet, largely unrealized opportunities for large-scale collaborative discovery in a number of areas including characterization of macromolecular sequence-structure-function relationships, discovery of complex genetic regulatory networks, etc.
Given the large number, autonomous nature and the size of the relevant data sources, gathering all of the data in a centralized location is generally neither desirable nor feasible. Hence, there is a need for methods to perform the necessary analysis of data where the data and the computational resources are available and transmit the results of analysis (knowledge acquired from the data) to where they are needed. More importantly, data sources developed by autonomous individuals or groups differ with respect to their ontological commitments (that is, assumptions concerning the objects that exist in the world, the properties or attributes of the objects, the possible values of attributes, and their intended meaning). Therefore, semantic differences among autonomous data sources are simply unavoidable. Because data sources that are created for use in one context often find use in other contexts or applications and because users often need to analyze data in different contexts from different perspectives, there is no single privileged ontology that can serve all users, or for that matter, even a single user, in every context. Effective use of multiple sources of data in a given context requires flexible approaches to reconciling such semantic differences from the user’s point of view.
To address the information integration and knowledge acquisition needs of collaborative scientific discovery, we have designed INDUS (INtelligent Data Understanding System), a federated, query-centric system for knowledge acquisition from distributed, semantically heterogeneous data (See Figure). INDUS employs ontologies and inter-ontology mappings, to enable a user or an application to view a collection of physically distributed, autonomous, semantically heterogeneous data sources (regardless of location, internal structure and query interfaces) as though they were a collection of tables structured according to an ontology supplied by the user. This allows INDUS to answer user queries against distributed, semantically heterogeneous data sources without the need for a centralized data warehouse or a common global ontology.
INDUS and the associated collection of software tools
(a)Support editing of ontologies and specification of semantic relationships between ontologies (using inter-ontology mappings [Bao and Honavar, 2004]) by users with some familiarity with the data sources, using a graphical user interface.
(b)Enable users to query distributed, semantically heterogeneous data and retrieve and manipulate results in a fashion that respects the user-imposed semantic relationships between different sources of data [Caragea et al., 2004b].
Support construction of predictive classifiers from semantically heterogeneous distributed data sources without having to assemble all of the data at a central location [Caragea et al., 2004a; Caragea et al., 2004b]. This is achieved by decomposing the task of learning from data into an information extraction task, that formulates and sends a statistical query to a data source, and a hypothesis generation task, that uses the resulting statistic to modify a partially constructed hypothesis (and further invokes the information extraction component as needed).