
Invited Talk: "Scientific
Data Integration: From the Big Picture to some Gory Details"
by Dr. Bertram Ludaescher
List of
Accepted Papers
Tentative
Workshop Schedule
Invited
Talk Slides
Online Workshop
Proceedings
Background:
Recent
advances in high performance
computing, high speed and high bandwidth communication, massive
storage, and software (e.g., web services) that can be remotely invoked
on the Internet present unprecedented opportunities in data-driven
knowledge acquisition in a broad range of applications in virtually all
areas of human endeavor including collaborative cross-disciplinary
discovery in e-science, bioinformatics, e-government, environmental
informatics, health informatics, security informatics, e-business,
education, social informatics, among others. Given the explosive growth
in the number and diversity of potentially useful information sources
in many domains, there is an urgent need for sound approaches to
integrative and collaborative analysis and interpretation of
distributed, autonomous (and hence, inevitably semantically
heterogeneous) data sources.
Machine learning offers some of the most cost-effective approaches to
automated or semi-automated knowledge acquisition (discovery of
features, correlations, and other complex relationships and hypotheses
that describe potentially interesting regularities from large data
sets) in many data rich application domains. However, the applicability
of current approaches to machine learning in emerging data rich
application domains presents several challenges in practice:
- Centralized access to data
(assumed by most machine learning algorithms) is infeasible because of
the large size and/or access restrictions imposed by the autonomous
data sources. Hence, there is a need for knowledge acquisition systems
that can perform the necessary analysis of data at the locations where
the data and the computational resources are available and transmit the
results of analysis (knowledge acquired from the data) to the locations
where they are needed.
- Ontological commitments
associated with a data source (that is, assumptions concerning the
objects that exist in the world, the properties or attributes of the
objects, the possible values of attributes, and their intended meaning)
are determined by the intended use of the data repository (at design
time). In addition, data sources that are created for use in one
context often find use in other contexts or applications. Therefore,
semantic differences among autonomous data sources are simply
unavoidable. Because users often need to analyze data in different
contexts from different perspectives, there is no single privileged
ontology that can serve all users, or for that matter, even a single
user, in every context. Effective use of multiple sources of data in a
given context requires reconciliation of such semantic differences from
the user’s point of view.
- Explicitly associating
ontologies with data repositories results in partially specified data,
i.e., data that are described in terms of attribute values at different
levels of abstraction. For example, the program of study a student in a
data source can be specified as Graduate Program (higher level of
abstraction), while the program of study of a different student in the
same data source (or even a different data source) can be specified as
Doctoral Program (lower level of abstraction).
Topics of Interest:
The workshop seeks to bring together researchers in
relevant areas of artificial intelligence (machine learning, data
mining, knowledge representation, ontologies), information
systems (information integration, databases, semantic Web), distributed
computing (service-oriented computing) and selected application areas
(e.g., bioinformatics, security informatics, environmental informatics)
to address several questions such as:
- What are some of
the research challenges presented by emerging data-rich application
domains such as bioinformatics, health informatics, security
informatics, social informatics, environmental informatics?
- How can we
perform knowledge discovery from distributed data (assuming different
types of data fragmentation, e.g., horizontal or vertical data
fragmentation; different hypothesis classes, e.g., naïve Bayes,
decision tree, support vector machine classifiers; different
performance criteria, e.g., accuracy versus complexity versus
reliability of the model generated, etc.)?
- How can we make
semantically heterogeneous data sources self-describing (e.g., by
explicitly associating ontologies with data sources and mappings
between them) in order to help collaborative scientific discovery from
autonomous information sources?
- How can we
represent, manipulate, and reason with ontologies and mappings between
ontologies?
- How can we learn
ontologies from data (e.g., attribute value taxonomies)?
- How can we learn
mappings between semantically heterogeneous data source schemas and
between their associated ontologies?
- How can we
perform knowledge discovery in the presence of ontologies (e.g.,
attribute value taxonomies) and partially specified data (data that are
described at different levels of abstraction within an ontology)?
- How can we
achieve online query relaxation when an initial query posed to the data
sources fails (i.e., returns no tuples)? That is, how do we perform a
query-driven mining of the individual sources that will result in
knowledge that can be used for query relaxation?
Submission
Requirements:
Postscript or PDF versions of papers, no more than 10 pages long
(including figures, tables, and references) in the ICDM camera-ready
format (
IEEE
2-column format), should be submitted electronically to
kadash-icdm05@cs.iastate.edu
by
October 9, 2005. Each paper
will be rigorously refereed
by at least 2 reviewers for technical soundness, originality, and
clarity of presentation. Accepted papers will be included in informal
workshop proceedings published by ICDM and distributed at the workshop.
Important Dates:
|
Paper submission
|
October 9, 2005
|
|
Notification of acceptance
|
October 16, 2005
|
|
Camera ready papers
|
October
22, 2005
|
|
Workshop date
|
November
27, 2005
|
Organizers:
- Naoki Abe - IBM
- Liviu Badea – ICI, Romania
- Doina Caragea - Iowa State University
- Marie desJardins -
University of Maryland, Baltimore County
- C. Lee Giles - Pennsylvania State University
- Vasant Honavar - Iowa State University
- Hillol Kargupta –
University of Maryland, Baltimore County
- Sally McClean --
University of Ulster at Coleraine
- Bamshad Mobasher –
DePaul University
- Ion Muslea - Language Weaver, Inc.
- C. David Page Jr.
– University of Wisconsin, Madison
- Alexandrin
Popescul - Ask Jeeves, Inc.
- Raghu Ramakrishnan - University of Wisconsin-Madison
- Steffen Staab --
University of Koblenz