Iowa State University

Iowa State University

 

Center for

Computational Intelligence, Learning, & Discovery

 

 

 

Algorithms & Software for Knowledge Acquisition from Heterogeneous Distributed Data

 

Personnel

Dr. Vasant Honavar, Professor of Computer Science and of Bioinformatics and Computational Biology, Principal Investigator.


Dr. Drena Dobbs
, Associate Professor of Molecular, Cell, and Developmental Biology, Co-Principal Investigator.


Dr. Doina Caragea
, Research Associate, Computer Science. Focus: Algorithms for learning classifiers from heterogeneous data, Efficient extraction of sufficient statistics from heterogeneous data, theoretical framework for knowledge acquisition from heterogeneous, distributed, autonomous data.


Summary

Development of high throughput data acquisition technologies together with advances in computing, and communications have resulted in an explosive growth in the number, size, and diversity of potentially useful information sources. However, the massive size, heterogeneity, autonomy, and distributed nature of the data repositories present significant hurdles in extracting knowledge from this data. Honavar's research on this topic, supported in part by an Information Technology Research (ITR) grant from the National Science Foundation (0219699) and a graduate fellowship from IBM seeks to overcome these hurdles through the design, analysis, and implementation of:

  • Efficient distributed and cumulative learning algorithms with provable performance guarantees (relative to their centralized or batch counterparts) for knowledge acquisition from distributed data sources
  • Customizable information extraction agents that can effectively exploit domain or context-specific ontologies supplied by the users to extract the information needed for learning (e.g., sufficient statistics) from distributed data sources despite differences in query capabilities, interfaces, ontologies, and access restrictions to facilitate analysis of heterogeneous distributed data from different perspectives
  • INDUS - a test-bed for knowledge acquisition from heterogeneous distributed data in computational molecular biology (e.g., characterization of protein sequence-structure-function relationships using diverse sources of biological data).
The resulting algorithms are being applied to representative data-driven knowledge discovery problems drawn from computational molecular biology.

 

more . . .

Funding

At present, primary source of funding for this project is: This project has benefited from funding for related, but not overlapping work from other sources including:
  • Discovering Protein Sequence-Structure-Function Relationships, Biological Information Science and Technology Initiative, National Institutes of Health (2003-2007). Vasant Honavar (PI), (with Drena Dobbs and Robert Jernigan), $1,022,000.
  • Pioneer Hi-Bred Graduate Fellowships in Bioinformatics and Computational Biology. Vasant Honavar (PI) (with doctoral students Adrian Silvescu and Carson Andorf). (2002-2004). $80,000.
  • IBM Doctoral Research Fellowship. Vasant Honavar (with doctoral student Doina Caragea). (2003-2004). $25,000.
In the past, some of the work leading up to this project was supported in part by:

 

Representative Publications

  1. Andorf, C., Silvescu, A., Dobbs, D., and Honavar, V. (2004). Probabilistic Graphical Models for Protein Function Classification. To appear.

  2. Andorf, C., Dobbs, D., and Honavar, V. (2004). Discovering Protein Function Classification Rules from Reduced Alphabet Representations of Protein Sequences.

  3. Caragea, D., Pathak, J., and Honavar, V. (2004). Learning Classifiers from Semantically Heterogeneous Data. In: Proceedings of the International Conference on Ontologies, Databases, and Applications of Semantics (ODBASE 2004), Agia Napa, Cyprus, 2004.

  4. Bao, J., and Honavar, V. (2004). Ontology Language Extensions to Support Localized Semantics, Modular Reasoning, and Collaborative Ontology Design and Reuse. To appear.

  5. Bao, J., Cao, Y., Tavanapong, W., and Honavar, V. (2004). Integration of Domain-Specific and Domain-Independent Ontologies for Colonoscopy Video Database Annotation. In: International Conference on Information and Knowledge Engineering (IKE 04). In press.

  6. Caragea, D., Silvescu, A., and Honavar, V. (2004). A Framework for Learning from Distributed Data Using Sufficient Statistics and its Application to Learning Decision Trees. International Journal of Hybrid Intelligent Systems. Vol. 1. pp. 80-89.

  7. Kang, D-K., Silvescu, A., Zhang, J., and Honavar, V. (2004). Generation of Attribute Value Taxonomies from Data for Data-Driven Construction of Accurate and Compact Classifiers. In: Proceedings of the IEEE International Conference on Data Mining.

  8. Pathak, J., Caragea, D., and Honavar, V. (2004). Ontology-Extended Component-Based Workflows: A Framework for Constructing Complex Workflows from Semantically Heterogeneous Software Components. In: Proceedings of the Workshop on Semantic Web and Databases (SWDB-04). Springer-Verlag Lecture Notes in Computer Science. In press.

  9. Yan, C., Dobbs, D., and Honavar, V. A Two-Stage Classifier for Identification of Protein-Protein Interface Residues. Bioinformatics. In Press., 2004.

  10. Yan, C., Honavar, V. and Dobbs, D. (2004). Identifying Protein-Protein Interaction Sites from Surface Residues - A Support Vector Machine Approach.. Neural Computing Applications. In press.

  11. Zhang, J. and Honavar, V. (2004). AVT-NBL - An Algorithm for Learning Compact and Accurate Naive Bayes Classifiers from Attribute Value Taxonomies and Data. In: Proceedings of the IEEE International Conference on Data Mining. In press.

  12. Atramentov, A., Leiva, H., and Honavar, V. (2003). A Multi-Relational Decision Tree Learning Algorithm - Implementation and Experiments.. In: Proceedings of the Thirteenth International Conference on Inductive Logic Programming. Berlin: Springer-Verlag.

  13. Caragea, D., Reinoso-Castillo, J., Silvescu, A. (2003). Statistics Gathering for Information Integration on the Web. In: Proceedings of the IJCAI-03 Workshop on Information Integration on the Web..

  14. Reinoso-Castillo, J., Silvescu, A., Caragea, D., Pathak, J. and Honavar, V. (2003). Information Extraction and Integration from Heterogeneous, Distributed, Autonomous Information Sources: A Federated, Query-Centric Approach.. IEEE International Conference on Information Integration and Reuse.

  15. Zhang, J. and Honavar, V. (2003). Learning Decision Tree Classifiers from Attribute Value Taxonomies and Partially Specified Data. In: Proceedings of the International Conference on Machine Learning (ICML-03). Washington, DC. In press.

  16. Reinoso-Castillo, J. (2002). Ontolgy-Driven Information Extraction and Integration from Autonomous, Heterogeneous, Distributed Data Sources -- A Federated Query-Centric Approach. Masters Thesis. Artificial Intelligence Research Laboratory. Department of Computer Science. Iowa State University.

  17. Zhang, J., Silvescu, A., and Honavar, V. (2002). Ontology-Driven Induction of Decision Trees at Multiple Levels of Abstraction. In: Proceedings of Symposium on Abstraction, Reformulation, and Approximation. Berlin: Springer-Verlag.

 

back to top

 

 

 

 

Atanasoff Hall

CILD is housed in Atanasoff Hall on the Northwest side of campus.

 

 

Center for Computational Intelligence, Learning, & Discovery
214 Atanasoff Hall
Ames, IA 50011-1041

Phone: (515)294-9074
Fax:    (515)294-0258