ARTIFICIAL INTELLIGENCE RESEARCH LABORATORY
    Center for Computational Intelligence, Learning, and Discovery
    Department of Computer Science


Algorithms and Software for Knowledge Acquisition from Heterogeneous, Distributed, Autonomous Information Sources
The Intelligent Data Understanding System (INDUS) Project

Personnel Project Summary Funding   Publications Software Talks Other Projects   ISU Artificial Intelligence Research Lab Center for Computational Intelligence, Learning, and Discovery

Personnel


Project Summary

Development of high throughput data acquisition technologies together with advances in computing, and communications have resulted in an explosive growth in the number, size, and diversity of potentially useful information sources. However, the massive size, heterogeneity, autonomy, and distributed nature of the data repositories present significant hurdles in extracting knowledge from this data. This research seeks to overcome these hurdles through the design, analysis, and implementation of:

The resulting algorithms and software can accelerate, potentially by an order of magnitude, the rate of scientific discovery in emerging data rich domains such as biological sciences. This research is closely integrated with education and training of graduate and undergraduate students in Computer Science and Bioinformatics and Computational Biology at Iowa State University.


Funding

At present, primary source of funding for this project is:

This project has benefited from funding for related, but not overlapping work from other sources including: In the past, some of the work leading up to this project was supported in part by:


Publications

  1. Bao, J., Hu, Z., Caragea, D., Reecy, J., and Honavar, V. A Tool for Collaborative Construction of Large Biological Ontologies. Fourth International Workshop on Biological Data Management (BIDM 2006), Krakov, Poland, IEEE Press. Vol. In press., Accepted, 2006.

  2. Bao, J., Caragea, D., and Honavar, V. Towards Collaborative Environments for Ontology Construction and Sharing. Proceedings of the International Symposium on Collaborative Technologies and Systems., Las Vegas, 2006.

  3. Bao, J., Caragea, D., and Honavar, V. A Distributed Tableau Algorithm for Package-based Description Logics. Proceedings of the Second International Workshop on Context Representation and Reasoning (CRR 2006), Riva del Garda, Italy, CEUR. Vol. In press., Accepted, 2006.

  4. Bao, J., Caragea, D., and Honavar, V. Modular Ontologies - A Formal Investigation of Semantics and Expressivity. In Proceedings of the First Asian Semantic Web Conference, Beijing, China, Springer-Verlag. Vol. In press., Accepted, 2006.

  5. Kang, D-K., Silvescu, A. and Honavar, V. RNBL-MN: A Recursive Naive Bayes Learner for Sequence Classification. Proceedings of the Tenth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2006). Lecture Notes in Computer Science., Berlin: Springer-Verlag. pp. 45-54, Accepted, 2006.

  6. Pathak, J., Basu, S., and Honavar, V. Modeling Web Service Composition Using Symbolic Transition Systems. AAAI '06 Workshop on AI-Driven Technologies for Services-Oriented Computing (AI-SOC), Boston, MA, AAAI Press, Accepted, 2006.

  7. Pathak, J., Basu, S., Lutz, R., and Honavar, V. MoSCoE: A Framework for Modeling Web Service Composition and Execution. IEEE Conference on Data Engineering Ph.D. Workshop, Atlanta, GA, 2006.

  8. Pathak, J, Yong, J. Honavar, V., McCalley, J. Condition Data Aggregation for Failure Mode Estimation of Power Transformers. Hawaii International Conference on Systems Sciences, IEEE Computer Society. pp. 241a, 2006.

  9. Vasile, F., Silvescu, A., Kang, D-K., and Honavar, V. TRIPPER: An Attribute Value Taxonomy Guided Rule Learner. Proceedings of the Tenth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Berlin: Springer-Verlag. pp. 55-59, 2006.

  10. Zhang, J., Kang, D-K., Silvescu, A. and Honavar, V. Learning Compact and Accurate Naive Bayes Classifiers from Attribute Value Taxonomies and Data. Knowledge and Information Systems. Vol. 9. No. 2. pp. 157-179, 2006.

  11. Caragea, D., Zhang, J., Bao, J., Pathak, J., and Honavar, V. (2005). Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous Information Sources (Invited paper). In: Proceedings of the 16th International Conference on Algorithmic Learning Theory. Lecture Notes in Computer Science. Singapore. Vol. 3734. pp. 13-44. Berlin: Springer-Verlag.

  12. Caragea, D., Silvescu, A., Pathak, J., Bao, J., Andorf, C., Dobbs, D., and Honavar, V. (2005). Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources. In: Data Integration in Life Sciences (DILS 2005) Springer-Verlag Lecture Notes in Computer Science. San Diego. Vol. 3615. pp. 175-190. Berlin: Springer-Verlag.

  13. Caragea, D., Bao, J., Pathak, J., Andorf, C,., Dobbs, D., and Honavar, V. Information Integration from Semantically Heterogeneous Biological Data Sources. Proceedings of the Sixteenth International Workshop on Databases and Expert Systems Applications (DEXA 05), Copenhagen, IEEE Computer Society. pp. 580-584, 2005.

  14. Kang, D-K., Zhang, J., Silvescu, A., and Honavar, V. Multinomial Event Model Based Abstraction for Sequence and Text Classification. Proceedings of the Symposium on Abstraction, Reformulation, and Approximation (SARA 2005), Edinburgh, UK, Berlin: Springer-Verlag. Vol. 3607. pp. 134-148, 2005.

  15. Kang, D-K., Fuller, D., and Honavar, V. Learning Misuse and Anomaly Detectors from System Call Frequency Vector Representation. IEEE International Conference on Intelligence and Security Informatics. Springer-Verlag Lecture Notes in Computer Science, Springer-Verlag. Vol. 3495. pp. 511-516, 2005.

  16. Kang, D-K., Fuller, D., and Honavar, V. Learning Classifiers for Misuse and Anomaly Detection Using a Bag of System Calls Representation. Proceedings of the 6th IEEE Systems, Man, and Cybernetics Workshop (IAW 05), West Point, NY, IEEE. pp. 118-125, 2005.

  17. Pathak, J,, Koul, N., Caragea, D., and Honavar, V. (2005). A Framework for Semantic Web Services Discovery. In: Proceedings of the 7th ACM International Workshop on Web Information and Data Management (WIDM 2005).. pp. 45-50. ACM Press.

  18. Yakhnenko, O., Silvescu, A., and Honavar, V. Discriminatively Trained Markov Model for Sequence Classification. IEEE Conference on Data Mining (ICDM 2005), Houston, Texas, IEEE Press, 2005.

  19. Zhang, J., Caragea, D. and Honavar, V. (2005). Learning Ontology-Aware Classifiers. In: Proceedings of the 8th International Conference on Discovery Science. Springer-Verlag Lecture Notes in Computer Science. Singapore. Vol. 3735. pp. 308-321. Berlin: Springer-Verlag.
  20. Caragea, D., Pathak, J., and Honavar, V. (2004). Learning Classifiers from Semantically Heterogeneous Data. In: Proceedings of the International Conference on Ontologies, Databases, and Applications of Semantics (ODBASE 2004), Agia Napa, Cyprus, 2004.

  21. Bao, J., Cao, Y., Tavanapong, W., and Honavar, V. (2004). Integration of Domain-Specific and Domain-Independent Ontologies for Colonoscopy Video Database Annotation. In: International Conference on Information and Knowledge Engineering (IKE 04). CSREA Press. pp. 82-88.

  22. Bao, J. and Honavar, V. (2004). Collaborative Ontology Building With Wiki@nt. In: Third International Workshop on Evaluation of Ontology Building Tools. Hiroshima.

  23. Caragea, D., Pathak, J. and Honavar, V. (2004). Learning Classifiers from Semantically Heterogeneous Data. In: International Conference on Ontologies, Databases, and Applications of Semantics (ODBASE 2004). Springer-Verlag Lecture Notes in Computer Science. Cyprus, Greece. Vol. 3291. pp. 963-980. Springer-Verlag.

  24. Caragea, D., Silvescu, A., and Honavar, V. (2004). A Framework for Learning from Distributed Data Using Sufficient Statistics and its Application to Learning Decision Trees. International Journal of Hybrid Intelligent Systems. Vol. 1. pp. 80-89.

  25. Kang, D-K., Silvescu, A., Zhang, J., and Honavar, V. (2004). Generation of Attribute Value Taxonomies from Data for Data-Driven Construction of Accurate and Compact Classifiers. In: Proceedings of the IEEE International Conference on Data Mining.

  26. Pathak, J., Caragea, D., and Honavar, V. (2004). Ontology-Extended Component-Based Workflows: A Framework for Constructing Complex Workflows from Semantically Heterogeneous Software Components. In: Proceedings of the Workshop on Semantic Web and Databases (SWDB-04). Springer-Verlag Lecture Notes in Computer Science. In press.

  27. Yan, C., Dobbs, D., and Honavar, V. (2004). A Two-Stage Classifier for Identification of Protein-Protein Interface Residues. In: Bioinformatics. Vol. 20. pp. i371-378.

  28. Yan, C., Honavar, V. and Dobbs, D. (2004). Identifying Protein-Protein Interaction Sites from Surface Residues - A Support Vector Machine Approach.. Neural Computing Applications. Vol. 13. pp. 123-129.

  29. Zhang, J. and Honavar, V. (2004). AVT-NBL - An Algorithm for Learning Compact and Accurate Naive Bayes Classifiers from Attribute Value Taxonomies and Data. In: Proceedings of the IEEE International Conference on Data Mining.

  30. Atramentov, A., Leiva, H., and Honavar, V. (2003). A Multi-Relational Decision Tree Learning Algorithm - Implementation and Experiments.. In: Proceedings of the Thirteenth International Conference on Inductive Logic Programming. Berlin: Springer-Verlag.

  31. Caragea, D., Reinoso-Castillo, J., Silvescu, A. (2003). Statistics Gathering for Information Integration on the Web. In: Proceedings of the IJCAI-03 Workshop on Information Integration on the Web..

  32. Reinoso-Castillo, J., Silvescu, A., Caragea, D., Pathak, J. and Honavar, V. (2003). Information Extraction and Integration from Heterogeneous, Distributed, Autonomous Information Sources: A Federated, Query-Centric Approach.. IEEE International Conference on Information Integration and Reuse.

  33. Zhang, J. and Honavar, V. (2003). Learning Decision Tree Classifiers from Attribute Value Taxonomies and Partially Specified Data. In: Proceedings of the International Conference on Machine Learning (ICML-03). Washington, DC. In press.

  34. Reinoso-Castillo, J. (2002). Ontolgy-Driven Information Extraction and Integration from Autonomous, Heterogeneous, Distributed Data Sources -- A Federated Query-Centric Approach. Masters Thesis. Artificial Intelligence Research Laboratory. Department of Computer Science. Iowa State University.

  35. Zhang, J., Silvescu, A., and Honavar, V. (2002). Ontology-Driven Induction of Decision Trees at Multiple Levels of Abstraction. In: Proceedings of Symposium on Abstraction, Reformulation, and Approximation. Berlin: Springer-Verlag.

  36. Caragea, D., Silvescu, A., and Honavar, V. (2001). Invited Chapter. Towards a Theoretical Framework for Analysis and Synthesis of Agents That Learn from Distributed Dynamic Data Sources. In: Emerging Neural Architectures Based on Neuroscience. Berlin: Springer-Verlag.

  37. Polikar, R., Udpa, L., Udpa, S., and Honavar, V. (2001). Learn++: An Incremental Learning Algorithm for Multi-Layer Perceptron Networks. IEEE Transactions on Systems, Man, and Cybernetics. Vol. 31, No. 4. pp. 497-508.

  38. Caragea, D., Silvescu, A., and Honavar, V. (2000). Agents That Learn from Distributed Dynamic Data Sources. In: Proceedings of the ECML 2000/Agents 2000 Workshop on Learning Agents. Barcelona, Spain.

  39. Honavar, V., Miller, L. and Wong, J. (1998). Distributed Knowledge Networks. In: Proceedings of the IEEE Information Technology Conference. Syracuse, NY.


Project Impact


Contributions within Discipline

The project has contributed to the development of provably sound ontology-based approaches to data integration that allow scientists to view and combine a given set of data sources from multiple ontological points of view based on the ontologies of their own choosing. The framework supports efficient extraction of sufficient statistics (e.g., counts that satisfy certain constraints on attribute values) needed for construction of classifiers under a broad range of assumptions concerning the capabilities offered by the information sources (execution of aggregate operators, execution of code supplied by the user). This work has also resulted in novel algorithms for exploiting particularly common types of ontologies -- class-subclass hierarchies and attribute-value taxonomies in learning compact, accurate, and comprehensible classifiers from semantically heterogeneous distributed data. These results collectively represent important contributions towards the realization of the Semantic Web for e-Science.


Contributions to Other Disciplines

This research has resulted in applications of data mining to two representative problems in computational molecular biology -- sequence-based prediction of protein function and identification of protein-protein interaction sites.


Contributions to Education and Human Resources

The project has, with the help of funds leveraged from other sources, has contributed to the training of several Ph.D. students in Computer Science (Doina Caragea, Adrian Silvescu, Jun Zhang, Jie Bao, Carson Andorf, Oksana Yakhnenko, Jyotishman Pathak, Dae-Ki Kang) and Bioinformatics and Computational Biology (Changhui Yan, and Mgavi Braithwaite) and two M.S. students (Jaime Reinoso-Castillo and Anna Atramentov). The project has also provided research opportunities for two undergraduate students. This research has led to the establishment of a Center for Computational Intelligence, Learning, and Discovery focused on large-scale data-driven e-Science at Iowa State University. This research has also strengthened interdisciplinary research collaborations between Vasant Honavar (a Computer Scientist), Drena Dobbs (a molecular biologist), Robert Jernigan (a biophysicist) and Heather Greenlee (a neuroscientist).


Integration of Research into Graduate and Undergraduate Curriculum

Honavar has developed and taught a module on machine learning approaches to bioinformatics based in part on the results of research supported by this award at an NSF-NIH supported Summer institute on Bioinformatics and Computational Biology for undergraduates and beginning graduate students from around the US this summer at Iowa State University. Some of the research problems and results have also been integrated into a course on machine learning.


Current Research Directions


Software


Talks and Posters