ACM Fourteenth Conference on
Information and Knowledge Management (CIKM)
CIKM and Workshops 2005
Tutorials
Tutorial
Workshops

Accessing XML Content: From DB and IR Perspectives

Sihem Amer-Yahia
AT & T Labs Research
Florham Park, NJ 07932, USA

Mounia Lalmas
Department of Computer Science
Queen Mary University of London

For more information click the title of the tutorial

Clustering Large and High Dimensional Data

Jacob Kogan
and
Charles Nicholas
Department of Mathematics and Statistics
University of Maryland Baltimore County

For more information click the title of the tutorial

Matching Words and Pictures: Problems, Applications and Progress

Latifur Khan
 Department of Computer Science
 Erik Jonsson School of Engineering and Computer Science
 University of Texas at Dallas

For more information click the title of the tutorial

Support Vector Machines for Information Management

James G. Shanahan
Turn Inc.
San Mateo, CA 94404, U.S.A.

For more information click the title of the tutorial

Probabilistic and Causal Modeling with Graphical Models in Information and Knowledge Management

Ricardo Silva
CALD - Center for Automated Learning and Discovery
Carnegie Mellon University

For more information click the title of the tutorial

High-Performance Indexing and Query Evaluation for Information Retrieval

Justin Zobel
School of Computer Science and Information Technology
RMIT University, Melbourne, Australia

For more information click the title of the tutorial

Accessing XML Content: From DB and IR Perspectives

Authors:

Sihem Amer-Yahia
AT & T Labs Research
Florham Park, NJ 07932, USA
sihem@research.att.com
Tel: +1 973 360 8957
Fax: +1 973 360 8871

Mounia Lalmas
Department of Computer Science
Queen Mary University of London
London, England, E1 4NS
mounia@dcs.qmul.ac.uk
Tel: +44 (0)20 7882 5200
Fax: +44 (0)20 8980 6533

Goal of Tutorial:

Documents today contain a mixture of structured and unstructured content. One way to format this mixed content is according to the adopted W3C (World Wide Web Consortium) standard for information repositories and exchanges, the so-called eXtensible Markup Language (XML)-http://www.w3.org/XML/.

An ever growing number of XML content is being made available. For example, the IEEE INEX (the INitiative for the Evaluation of XML) collection contains IEEE papers in XML form, including structured information such as the names of authors and the date of publication, and also unstructured information such as the text content of the paper. Other examples are Shakespeare's plays in XML, DBLP in XML, SIGMOD Record in XML, the United States Library of Congress documents in XML, medical data in XML and the MERIMEE collection of building descriptions in France. Furthermore, application domains such as the Library Science have growing needs to seamlessly query over both structured and unstructured content. As a result, a lot of approaches have been developed in the past few years to store and query such mixed content.

The development of approaches to access XML content has generated a wealth of issues that are being addressed by both the database (DB) and information retrieval (IR) communities. The DB community has traditionally focused on developing query languages and efficient evaluation algorithms for highly structured content. In contrast, the IR community has focused on searching unstructured content, and has developed various techniques for ranking query results and evaluating their effectiveness. Recent trends in DB and IR research demonstrate a growing interest in adopting IR techniques in DBs and vice versa for accessing XML content.

In this tutorial, we propose to survey the main challenges that arise from accessing XML content from both the DB and IR perspectives. In particular, we will describe the W3C and INEX efforts in both communities and the challenges behind storing XML documents, designing query languages, appropriate scoring and ranking methods, implementation architectures and query evaluation algorithms and, summarize open research issues. We believe that this tutorial is necessary to drive the attention of DB and IR researchers and practitioners to participate together in solving these issues.

In the rest of the proposal, we provide an overview of the organization of this tutorial, the intended audience and the authors' biographies. A selected DB and IR bibliography is also included at the end of the proposal.

This tutorial, in this form, has not presented at any other conference before. However, Sihem Amer-Yahia gave a lecture on full-text search in XML at the 7th Extending DataBase Technology summer school on XML and databases (EDBT'05) in Sardinia, Italy in September 2004. She also organized a panel at the last Special Interest Group for the Management of Data (SIGMOD'05) conference in Baltimore USA in June 2005 on the integration of DB and IR technologies and is presenting a tutorial on XML Full-Text Search at the Very Large DataBases (VLDB'05) conference in Trondheim, Norway in August 2005. In addition, Mounia Lalmas will be giving a lecture on structure/XML retrieval at the 5th European Summer School in Information Retrieval (ESSIR'05) in Dublin in September 2005, where the IR approach to accessing XML content will be presented. She has also given two keynote speeches on IR approaches to XML retrieval and their evaluation, at the 5th Dutch-Belgian Information Retrieval Workshop (DIR'05) in Utrecht, The Netherlands, in January 2005, and the Premiere COnference en Recherche d'Information et Applications (CORIA'04), in Toulouse, France in March 2004.

Content of Tutorial:

The tutorial is organized in 3 parts. The first part presents XML documents and queries, the W3C and INEX efforts and their scope, and describes the challenges behind storing and accessing XML content. The second part of the tutorial presents the DB (often referred to as the data-centric) and IR (often referred to as the document-centric) views of XML. The DB view includes storing XML documents in relational databases, query language such as XPath, XQuery and XQuery Full-Text as well as query evaluation issues. The IR view includes retrieval methods for ranking XML elements and assessing the effectiveness of retrieval results. The last part of the tutorial concludes with the main challenges ahead in both communities.

Part I: Introduction and Motivation

The first part will provide a brief introduction of what XML is and its goals. We will then describe the two efforts, one in DB research and one in IR research, as they can be viewed as having set the scene for research in XML content access, from motivation, challenge, development, implementation and validation. In the DB area, we will discuss the W3C effort in designing the XQuery 1.0 and XPath 2.0 languages that provide powerful primitives to navigate the structure of XML documents. Many database researchers and practitioners have influenced the design of these languages, which therefore impact on DB approaches to access XML content. On the other hand, in the IR area, INEX was set up at the beginning of 2002 with the aim to establish infrastructures, XML test-suites and appropriate measurements for evaluating the performance (efficiency and effectiveness) of XML information access approaches. INEX has proposed its own query language, NEXI, an enhanced subset of XPath that deal specifically with content-oriented aspects.

Part II: Accessing XML Content

The DB View:

The DB community has a data-centric view and focused on issues related to storing XML documents in relational databases. Applications, such as electronic markets, that produce and consume large volumes of data require efficient and reliable storage and retrieval of XML data. Many techniques for XML storage have been proposed, including flat files, relational database management systems, object-oriented database systems, LDAP directories, and native XML database systems. We will review the main solutions proposed for XML storage. We will then describe XQuery, a SQL-like W3C language for querying XML and XPath, its core navigation language. We will also describe XQuery Full-Text, an extension to XPath/XQuery to support XML full-text search, that is being developed within the W3C. Finally, we will focus on efficient evaluation techniques for XML that were proposed in the DB community.

The IR View:

The IR view is concerned with the content-oriented retrieval of XML documents, and aims at supporting more precise access to XML documents by retrieving XML document components (the so-called XML elements) instead of whole documents in response to users' queries. Therefore, in principle, XML elements of any granularity (for example a paragraph or the section enclosing it) are potential answers to a query, as long as they are relevant. This constitutes a major departure from traditional IR and introduces new challenges. A first challenge is that XML retrieval systems need not only find relevant information in the XML documents, but also determine the appropriate level of component granularity to return to the user. A second challenge is that the retrieved XML elements need not only be at the right level of granularity, but also to provide the minimum context needed (for users) to understand the content of that element. In this sub-part we will discuss approaches based on traditional IR models (e.g. the vector space model, probabilistic models and language models) that have been adapted to the XML retrieval task. We will also discuss approaches that have been specifically developed for the XML retrieval task. The query language for content-oriented XML access, NEXI proposed at INEX will also be discussed and compared to XQuery Full-Text. We will finish with issues related to the evaluation of content-oriented XML retrieval, which required major re-design of the standard IR evaluation methodology, to deal with the nested nature of XML elements.

Part III: Open Research

The aim of this last part of the tutorial is to bridge together common goals and achievements in the DB and IR communities and summarize some of the common research challenges that would benefit from a better integration of DB and IR efforts. In particular, we will focus on full-text search in XML, approximate querying and ranking and the necessity for formalizing scoring methods and understanding efficiency and effectiveness issues in integrating queries on both structure and content. We will describe recent DB research on efficiently evaluating top-k queries in XML and recent IR research on relevance assessment granularity of retrieval for XML.

Intended Audience and Prerequisites:

We target researchers in DB and IR, software and application developers, and the XML community.

The audience is expected to know basic concepts on the relational data model, query processing, indexing and retrieval models, and IR evaluation methodology. Knowledge of XML is not required.

Biographies:

Sihem Amer-Yahia is a Senior Technical Specialist at AT\&T Labs Research. She received her Ph.D. degree from the University Paris XI-Orsay and INRIA. She has been working on various issues related to XML query processing. Sihem is a co-editor of the XQuery Full-Text language specification and use cases published in April 2005 by the W3C Full-Text Task Force. She is one of the main participant to the GalaTex project (www.galaxquery.org/galatex), a conformance implementation of XQuery Full-Text.

Mounia Lalmas is a reader in information retrieval at the Department of Computer Science, as Queen Mary University of London. Here research focuses on the development and evaluation of intelligent access to interactive heterogeneous and complex information repositories. She has applied uncertainty theories for the modeling of IR tasks in an expressive and unified manner and covering a range of domains such as Web, XML, and MPEG-7. She is the co-leader of the INEX initiative, with over 50 participating organizations worldwide.

Bibliography:

  1. S. Amer-Yahia, C. Botev, J. Shanmugasundaram. TeXQuery: A Full-Text Search Extension to XQuery. WWW 2004.
  2. R. Baeza-Yates, B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.
  3. J. Bosak. The plays of Shakespeare in XML. http://www.oasis-open.org/cover/bosakShakespeare200.html
  4. J. M. Bremer, M. Gertz. XQuery/IR: Integrating XML Document and Data Retrieval. WebDB 2002.
  5. E. W. Brown. Fast Evaluation of Structured Queries for Information Retrieval. SIGIR 1995.
  6. D. Carmel, Y. S. Maarek, M. Mandelbrod, Y. Mass, A. Soffer. Searching XML Documents via XML Fragments. SIGIR 2003.
  7. T. T. Chinenyanga, N. Kushmerick. Expressive and Efficient Ranked Querying of XML Data. WebDB 2001.
  8. S. Cohen et al. XSEarch: A Semantic Search Engine for XML. VLDB 2003.
  9. W. B. Croft. Language Models for Information Retrieval. Invited Talk. ICDE 2003.
  10. DBLP in XML. http://dblp.uni-trier.de/xml/
  11. N. Fuhr, K. Grossjohann. XIRQL: An Extension of XQL for Information Retrieval. SIGIR 2000.
  12. N. Fuhr, M. Lalmas, S. Malik, and Z. Szlavik, editors. Advances in XML Information Retrieval, Third International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2004, volume 3493 of Lecture Notes in Computer Science. Springer, 2005.
  13. N. Fuhr, T. Rolleke. A Probabilistic Relational Algebra for the Integration of Information Retrieval and Database Systems. ACM TOIS 15(1), 1997.
  14. L. Guo, F. Shao, C. Botev, J. Shanmugasundaram. XRANK: Ranked Keyword Search over XML Documents. SIGMOD 2003.
  15. Health Level Seven. http://www.hl7.org
  16. V. Hristidis, L. Gravano, Y. Papakonstantinou. Efficient IR-Style Keyword Search over Relational Databases. VLDB 2003.
  17. Initiative for the Evaluation of XML Retrieval. http://inex.is.informatik.uni-duisburg.de:2004/
  18. G. Kazai, M. Lalmas, and A. P. de Vries. The overlap problem in content-oriented xml retrieval evaluation. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 72--79, 2004.
  19. G. Kazai, M. Lalmas, N. Fuhr, and N. Govert. A report on the first year of the initiative for the evaluation of xml retrieval. JASIST, 55(6):551--556, 2004.
  20. J. Kamps, M. de Rijke, and B. Sigurbjornsson. Length normalization in xml retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 80--87, 2004.
  21. Library of Congress. http://lcweb.loc.gov/crsinfo/xml/
  22. S. Liu, Q. Zou, and W. W. Chu. Configurable indexing and ranking for xml information retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 88--95, 2004.
  23. MERIMEE. http://www.culture.gouv.fr/documentation/merimee/accueil.htm
  24. B. Piwowarski and M. Lalmas. Providing consistent and exhaustive relevance assessments for xml retrieval evaluation. In Proceedings of the 2004 ACM CIKM International Conference on Information and Knowledge Management, pages 361--370, 2004.
  25. S. Robertson. The probability ranking principle in IR. Journal of Documentation 33, 1977.
  26. G. Salton, M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
  27. D. Shin, H. Jang, H. Jin. BUS: An Effective Indexing and Retrieval Scheme in Structured Documents. Proc. 3rd Int. Conf. on Dig. Lib., 1998.
  28. Sigmod Record in XML. http://www.acm.org/sigmod/record/xml/
  29. A. Trotman and B. Sigurbjornsson. Narrowed extended xpath i (nexi). In Advances in XML Information Retrieval, Third International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2004, pages 16--40, 2004.
  30. A. Theobald, G. Weikum. The Index-Based XXL Search Engine for Querying XML Data with Relevance Ranking. EDBT 2002.
  31. F. Weigel, H. Meuss, K. U. Schulz, F. Bry. Content and Structure in Indexing and Ranking XML. WebDB 2004.
  32. The World Wide Web Consortium. XQuery 1.0: An XML Query Language. W3C Working Draft. http://www.w3.org/TR/xquery/
  33. The World Wide Web Consortium. XML Path Language (XPath) 2.0. W3C Working Draft. http://www.w3.org/TR/xpath20/
  34. The World Wide Web Consortium. XQuery 1.0 and XPath 2.0 Full-Text Use Cases. W3C Working Draft. http://www.w3.org/TR/xmlquery-full-text-use-cases/
  35. The World Wide Web Consortium. XQuery 1.0 and XPath 2.0 Full-Text. W3C Working Draft. http://www.w3.org/TR/xquery-full-text/
  36. C. Zhang, J. Naughton, D. DeWitt, Q. Luo, G. Lohman. On Supporting Containment Queries in Relational Database Management Systems. SIGMOD 2001.

top

Clustering Large and High Dimensional Data

Authors:

Jacob Kogan
and
Charles Nicholas
Department of Mathematics and Statistics
University of Maryland Baltimore County

Objective

To provide a concise review of some existing clustering techniques and present recent developments relevant to information and knowledge management. To identify challenging problems facing the development of future clustering algorithms capable of handling large and high dimensional sets of data.

Contents

Knowledge Management and Clustering

Large and often high dimensional data sets are now increasingly common and available. Clustering techniques are used to discover natural groups in data sets, and to identify abstract structures that might reside there, without having necessarily having any background knowledge of the characteristics of the data. Clustering has been used in a variety of areas, including: computer vision, VLSI design, data mining, bio--informatics (e.g. gene expression analysis), and information retrieval, to name just a few. Within IR, clustering techniques have been applied in text mining and web page clustering, among others. Indeed, clustering has been a subject of interest in IR for many years. The well--known ``cluster hypothesis", for example, which says that closely associated documents tend to be relevant to the same requests (van Rijsbergen 1979) suggests that document clustering should result in more effective, as well as more efficient, retrieval.

The tutorial provides an overview of clustering, and an introduction to some recently developed clustering techniques. We place particular attention on document clustering, and on the applications of modern nonlinear optimization tools.

Old and New Approaches to Clustering

Clustering is partitioning of a data set into groups of similar objects called clusters. Although there is considerable interest in the use and development of clustering methods in information technology research areas, clustering has a long and rich history in a number of fields, including biology, psychiatry, archeology, geology, geography and marketing.

The tutorial offers a brief overview of classical clustering techniques, and discusses recently developed clustering algorithms based on numerical linear algebra and nonlinear optimization tools. Those include, for example:

  • spectral clustering,
  • graph-based clustering algorithms,
  • SVD--based methods,
  • $k$-means like clustering algorithms,
  • Deterministic Annealing,
  • BIRCH,
to name just a few.

Often the quality of a partition is measured by an objective function associated with a proximity measure, and an optimal partition is one that optimizes the objective function. While finding the optimal solution to many clustering problems is known to be NP-hard, various estimates to optimal solutions and bounds for values of the associated objective functions are provided by spectral clustering methods (through relaxation), and nonlinear optimization techniques (through the dual optimization approach). A number of additional methods to assess clustering results, such as confusion matrix, classification accuracy, average purity, average entropy, mutual information and others will be presented.

Categorization of clustering algorithms is neither straightforward, nor canonical. Motivated by Text Mining and Information Retrieval applications, the tutorial aims to highlight advances in clustering large and high dimensional data sets. Vectors representing documents are typically very sparse, and it is the sparsity that permits the algorithms to be efficient.

We present clustering schemes, i.e., a sequence of clustering algorithms applied to the same data set, so that the output of algorithm $n$ becomes the input of algorithm $n+1$. As a particular example we consider SVD--based clustering algorithms followed by $k$-means like clustering algorithms. While the output of an SVD--based algorithm provides the initial partition for a $k$-means like algorithm, the $k$-means like algorithm further improves the output of the SVD--based algorithm. The algorithms complement and reinforce one another. Application of the clustering scheme improves the final partition. Computational complexity associated with applications clustering schemes will be discussed.

Similarity is fundamental to the definition of a cluster, and the introduction of a measure similarity (or proximity) between two patterns is essential for many clustering algorithms. While many of the classical clustering algorithms work with squared Euclidean distance, other distance-like functions may be more appropriate for text data. We shall present and discuss examples of different pattern proximity measures. In particular, we present recent clustering developments based on Bregman and Csiszar divergences.

Clustering and Optimization

The classical $k$-means clustering algorithm is probably one of the most popular techniques for data clustering. While partitioning a dataset into a prescribed number of clusters, the algorithm alternates between the computation of clusters and centroids of a dataset partition attempting to minimize an objective function defined on partitions of the dataset (i.e. on the set of subsets of the dataset). The roles of clusters and centroids are interchangeable, in the sense that when a partition (i.e. a set of clusters) is available one can compute the corresponding centroids. On the other hand, for a given set of centroids one can build the corresponding partition.

Rather than focus on partitions one can concentrate on centroids. In such a case the objective function to be minimized becomes a function of centroids. Hence, the clustering problem can in fact be formulated as a minimization of a non-smooth and non-convex function of a vector argument. A way to address this minimization problem is by introducing a family of parametrized smooth optimization problems that converge to the original non-smooth and non-convex objective function. We provide details of the ``smoothing" approach that can be applied to a variety of distance--like measures and, in particular, recovers the Deterministic Annealing (DA) for clustering. %which was introduced by Rose, Gurewitz and Fox.

Intended audience

Researchers and practitioners engaged in problems relevant to information and knowledge management, as well as data and knowledge bases. No prior knowledge of clustering is assumed.

Prerequisites

No prerequisites are required. A knowledge of basic linear algebra would be helpful.

Name and biography of the instructors

Jacob Kogan
is an Associate Professor in the Department of Mathematics and Statistics at the University of Maryland Baltimore County. Dr. Kogan received his Ph.D. in Mathematics from Weizmann Institute of Science, and has held teaching and research positions at the University of Toronto and Purdue University. His research interests include Text and Data Mining, Optimization, Calculus of Variations, Optimal Control Theory, and Robust Stability of Control Systems. Since 2001 he has also been affiliated with the Department of Computer Science and Electrical Engineering, University of Maryland Baltimore County.

Charles Nicholas
is currently Professor and Chair at the Computer Science and Electrical Engineering Department, UMBC, where he has been since 1988. He received his Ph.D. from The Ohio State University in 1988. Dr. Nicholas' research interests include electronic document processing, information retrieval, and software engineering. Dr. Nicholas has served five times as the General Chair of the ACM Conference on Information and Knowledge Management (CIKM), most recently in 2002. He also twice chaired the Workshop on Digital Document Processing, PODP'96 and PODDP'98.

top

Matching Words and Pictures: Problems, Applications and Progress

Authors:

Latifur Khan, Ph.D.
Assistant Professor and Director of the UTD Database Laboratory (DBL@UTD)
Department of Computer Science
Erik Jonsson School of Engineering and Computer Science
Box 830688, EC 31
University of Texas at Dallas
Richardson, TX 75083-0688
Email: lkhan@utdallas.edu         Voice: (972) 883-4137
URL: http://www.utdallas.edu/~lkhan   Fax..: (972) 883-2349

Abstract:

The development of technology generates huge amounts of non-textual information, such as images. An efficient image annotation and retrieval system is highly desired. Clustering algorithms make it possible to represent visual features of images with finite symbols. Based on this, many statistical models, which analyze correspondence between visual features and words and discover hidden semantics, have been published. These models improve the annotation and retrieval of large image databases.

In this tutorial, first, we will present current state of the art and its shortcomings. We will present some classical models (e.g., translation model (TM), cross-media relevance model etc.). Second, we will present weighted feature selection algorithm as a solution to the existing problem. For a given cluster, we determine relevant features based on histogram analysis and assign greater weight to relevant features as compared to less relevant features. Third, we will exploit spatial correlation to disambiguate visual features, and spatial relationship will be constructed by spatial association rule mining. Fourth, we will present the continuous relevance model and multiple Bernoulli model for avoiding clustering. We will present mechanisms to link visual tokens with keywords based on these models. Fifth, we will present mechanisms to improve accuracy of classical model, TM by exploiting the WordNet knowledge-base. Sixth, we will present a framework to model semantic visual concept in video/images by fusing multiple evidence with the usage of an ontology. Seventh, we will show that weighted feature selection is better than traditional ones (TM) for automatic image annotation and retrieval. Finally, we will discuss open problems and future directions in the domain of image and video.

Aims/Learning objectives:

To provide automated annotation of images/pictures.

Scope:

Data Mining and Multimedia Information Management.

Keywords:

Multimedia Information Retrieval, Image, and Data Mining.

Target Audience:

Researchers and practitioners in the multimedia field with no previous or little knowledge about data mining and machine learning.

Prerequisites:

Attendants are expected to have the following background:

  • Knowledge of basic computer science principles and skills
  • Familiarity with basic probability theory

Justification:

In this tutorial our primary focus will be: how low level image features can be linked with keywords/high level concepts with the usage of clustering, classification, Expected maximization (EM), and singular value decomposition (SVD). Note that all these topics are related to data mining and machine learning areas. However, how the low level features will be extracted from images will not be covered in this tutorial. We assume that we will use current state of the art from computer vision research, (i.e., blob-world, normalized cut etc.).

Duration:

3 hours.

Enough material such as handouts and slides can be found at:
http://www.utdallas.edu/~lkhan/WWW2005Final.pdf

Biography:

Education

Ph.D. in Computer Science, University of Southern California (USC), Aug, 2000
Dissertation: Ontology-based Information Selection.
M.S. in Computer Science, University of Southern California, Dec, 1996.
B.S. in Computer Science and Engineering, Bangladesh University of Engineering and Technology, Nov, 1993.

Work Experience

Assistant Professor: Computer Science Dept., University of Texas at Dallas, since Sept, 2000.
Graduate Research Assistant: Integrated Media Systems Center, USC, Sept, 1997--Aug, 2000, and Information Sciences Institute (ISI), USC, Sept, 1995-Aug, 1997.
Lecturer: Computer Science & Engineering, Bangladesh University of Engineering and Technology, Dec, 1993 -Aug, 1995.

Service to the Profession

  • Half Day Tutorial, "Matching Words and Pictures - Problems, Applications, and Progress", in 14th ACM International World Wide Web Conference, WWW2005, May 2005, Chiba, Japan, and in MIS 2005 International Workshop on Multimedia Information Systems, Sorrento, Italy.
  • Program Committee Member of 11th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Chicago, August, 2005.
  • Program Committee Member of 17tth IEEE International Conference on Tools with Artificial Intelligence (ICTAI), November, 2005, Hong Kong.
  • Program Committee Member of the 13th International Conference on Cooperative Information Systems (CoopIS 2005), Agia, Cyprus, Oct, 2005.
  • Program Committee Member of International Conference on Database and Expert Systems Applications DEXA 2005, September, 2005, Copenhagen, Denmark.
  • Program Committee Member of IEEE 5th Symposium on Bioinformatics and Bioengineering, Minnesota, 2005.
  • Program Chair of ACM 6th International Workshop on Multimedia Data Mining (MDM/KDD2005), August 2005, Chicago, Illinois.
  • Program Co-Chair in First International Workshop on Geographic and Biological Data Management (GBDM04) in conjunction with COMPSAC 2004, September 28-30, 2004, Hong Kong.
  • Guest Editor, International Journal of Knowledge and Information Systems (KAIS) by Springer-Verlag, a special issue on “Multimedia Data Mining.”
  • Program Committee Member of IEEE International Conference on Multimedia & Expo (ICME) July, 2003, Maryland.
  • Invited Reviewer of IEEE Transaction on Knowledge and Data Engineering (TKDE), ACM Transaction on Information Systems (TOIS), International Journal of Cooperative Information Systems (IJCIS), International Journal of Knowledge and Information Systems (KAIS) by Springer-Verlag.
  • NSF Panelist at IIS division in 2003 and 2004.
Awards
  • A Distributed Component Repository for Rapid Synthesis of Adaptive Real-Time Systems, National Science Foundation, Sept, 2001 -- Aug, 2004, NGS 0103709.
  • Developing Advanced Middleware for Convergence of IT and Telecommunications, Alcatel USA, Jan. 15, 2004, to Jan. 15, 2005.
Research Interests

Data Mining, Multimedia Information Management, Database Systems, Bio Informatics, Intrusion Detection, Information Retrieval, Web-based Information Integration and management.

Selected Publications
  1. L. Khan, M. Awad and B. Thuraisingham, "A New Intrusion Detection System using Support Vector Machines and Hierarchical Clustering," to appear The VLDB Journal: The International Journal on Very Large Databases, ACM/Springer-Verlag Publishing.
  2. F. Luo, L. Khan , F. Bastani, I-Ling Yen and J. Zhou, “A Dynamical Growing Self-Organizing Tree (DGSOT) for Hierarchical Clustering Gene Expression Profiles,” Bioinformatics Journal, 20 (16): 2605-2617 (2004), Oxford University Press, UK.
  3. L. Khan, F. Luo, and K. Tang, “Hierarchical Clustering of Complex Data,” to appear in a special issue of the International Journal on Artificial Intelligence Tools, published by World Scientific.
  4. L. Khan, D. McLeod, and E. Hovy, “Retrieval Effectiveness of Ontology-based Model for Information Selection,” The VLDB Journal: The International Journal on Very Large Databases, ACM/Springer-Verlag Publishing, Vol. 13(1): 71-85 (2004).
  5. L. Khan and D. McLeod, “Audio Structuring and Personalized Retrieval Using Ontologies,” in Proc. of ACM/IEEE Advances in Digital Libraries, Library of Congress, Washington, DC, pp. 116-126, May 2000.
  6. L. Khan, “Structuring and Querying Personalized Audio using Ontologies,” in Proc. of ACM Multimedia, Volume 2, Orlando, FL, pp. 209-210, Nov, 1999.
  7. L. Khan, D. McLeod, E. Hovy, “A Framework for Effective Annotation of Information from Closed Captions Using Ontologies” to appear in Journal of Intelligent Information Systems, Kluwer Academic Publishers.
  8. C. Shahabi, L. Khan, and D. McLeod, “A Probe-Based Technique to Optimize Join Queries in Distributed Internet Databases,” International Journal of Knowledge and Information Systems (KAIS) by Springer-Verlag, vol. 2 no. 3, pp. 372-385, 2000.
  9. L. Wang, L. Liu, and L. Khan, “Automatic Image Annotation and Retrieval using Weighted Feature Selection” to appear in international Journal Multimedia Tools and Applications, Kulwer Publisher.
  10. L. Khan, D. McLeod, and C. Shahabi, “An Adaptive Probe-Based Technique to Optimize Join Queries in Distributed Internet Databases,” International Journal of Database Management, 12(4): 3-14 (2001).
  11. E. Bertino1, L. Khan, R. Sandhu and B. Thuraisingham, “Secure Knowledge Management,” to appear in IEEE Transactions on Systems Man and Cybernetics, Part A.
Advisor:

Dennis McLeod, Computer Science, University of Southern California.

Advisee:

Ph.D.: M. Awad, M. Richard, L. Wang, M. Alam (co-advised), Yuhan Chen (co-advised), Ryan Layfiield (co-advised), Li Liu (co-advised), N. Tsybulnik (co-advised).

Graduated:

F. Luo (Ph.D. –summer 2004, initial employment at OAK Ridge National Laboratory), A. Bashir (M.S. thesis), Manish Gupta (M.S. thesis), M. Mirza (M.S. thesis), C. Breen (M.S.), R. Bhairampally (M.S.), and Y. Rao (M.S.).

top

Support Vector Machines for Information Management

Authors:

Dr. James G. Shanahan
Chief Scientist
Turn Inc.
1400 Fashion Island Blvd. Suite 510
San Mateo, CA 94404, U.S.A.
Tel: 650-522-4308

Abstract:

Support vector machines (SVM) are a general purpose suite of machine learning algorithms for classification and regression that were introduced by Vapnik and his colleagues in 1992. Generic support vector machines (SVMs) provide excellent performance on a variety of learning problems including hand-written character recognition, face detection and, most recently, text categorization.

Practical applications of SVMs to text can be found in areas such as knowledge management, process improvement, CRM (Customer Relationship Management), text mining, alerting, intelligence and law enforcement, spam and porn filtering, and bioinformatics. Most fall under the generic umbrellas of text classification and adaptive filtering.

Research on customizing SVMs for text processing (ranging from classification to clustering) has exploded in the past five years. This has resulted in a plethora of new approaches and many interesting and commercially viable applications. However, the concrete steps required to reduce SVM theory to practice are often not obvious or clearly explained in the research literature.

Contents:

This tutorial provides a self-contained and systematic exposition of the following key concepts in SVMs:

  • Support Vector Machines
    • Linear Separators
    • Primal SVMs
    • Dualspace SVMs
    • Hard and Soft SVMs
  • Kernels
    • Linear, Polynomial, RBF
    • String Kernels
    • Latent Semantic Kernels
  • SVM Learning Algorithms
    • Perceptron
    • Kernel Adatron
    • SMO learning algorithm (and improvements)
    • Quadratic Programming
This tutorial will show how generic SVMs can be customized for text processing tasks such as classification and filtering. Topics covered here will include:
  • Uneven Margin Based Learning;
  • Thresholding SVMs;
  • Transductive SVMs; and
  • Shrinkage.
SVM-based solutions to actual text-processing problems will be described and compared to other more traditional approaches.

Discussion of techniques will be supported by live demonstrations.

After attending this tutorial, you should:

  • Understand the fundamentals and the important ideas behind SVMs and kernels, with the help of illustrative examples in the domain of text classification and filtering.
  • Have an in-depth understanding of how to implement an SVM learning algorithm or use a publicly available SVM package for the tasks of text classification and filtering.
  • Have technical insight into technologies that you may see in commercial applications such as knowledge management, process improvement, CRM (Customer Relationship Management), text mining, alerting, intelligence and law enforcement, spam and porn filtering, and bioinformatics.
  • Understand the intellectual property surrounding SVM technology.
  • Be familiar with the exciting and promising areas of research in this domain.

TARGET AUDIENCE

This tutorial is especially targeted at people who are interested in implementing support vector machines for text processing tasks; people who are responsible for understanding implications of using such systems; people who want to understand and extend the core concepts; and people who want to understand the intellectual property surrounding the core algorithms.

Biography

James G. Shanahan is chief scientist at Turn Inc, where his primary focus is on developing and exploiting advanced techniques in information science to build a next generation online advertising network. Prior to joining Turn, he was principal research scientist at Clairvoyance Corporation where he led the Knowledge Discovery from Text Group. While at Clairvoyance Corporation he was actively involved in developing text and process mining systems. Previously he was research scientist at Xerox Research Center Europe (XRCE), Grenoble, where, as a member of the Co-ordination Technologies Group, he developed Document Souls, a patented document-centric approach to information access..

He has published five books in the area of machine learning and information processing including a book on affect and opinion mining. In addition, he has authored over 50 research publications and is an inventor on numerous patents, five of which have been granted. He has given numerous tutorials at the Search Engines Conference, and at the ACM CIKM conference. He has been a member of the organizing and program committees in numerous international conferences (e.g., ACM SIGIR, ACM CIKM, IEEE FUZZ-IEEE) and workshops and is an active journal reviewer. He was co-organizer of the AAAI Spring Symposium, EAAT, on Affect and Opinion Modeling (Stanford, 2004) and a style in text workshop at SIGIR 2005.

top

Probabilistic and Causal Modeling with Graphical Models in Information and Knowledge Management

Ricardo Silva
CALD - Center for Automated Learning and Discovery
Carnegie Mellon University

Abstract:

Probability and causality are the foundations of many decision support systems. Building and using probabilistic and causal models, however, can be a complex task. One way of approaching this problem is through graphical models, where dependencies between random variables are encoded in the language of graphs. Graphical representations are powerful tools for introducing prior knowledge, estimating models from data, assessing plausibility of estimated models, and to perform approximate inference where standard probabilistic models are computationally intractable. This is possible because in many real-world domains there is a natural structure that allows one to decompose a problem into independent substructures.

Goal of Tutorial:

In this tutorial, we give an introduction on how to use graphical models. The goal is to provide an overview of the fundamental issues, from modeling background knowledge to statistical and computational issues that arise when applying graphical models. The focus will be more on the ideas underlying popular approaches rather than technical details. The tutorial will include software demonstrations and examples in applications such as information retrieval and clustering.

OUTLINE
  1. INTRODUCTION TO GRAPHICAL MODELS
    1. Concepts of probability and causality
    2. The advantages of graphical models
    3. Knowledge representation through graphical models
      1. Directed acyclic graphs
      2. Overview of other types of graphs
  2. INFERENCE: HOW TO MAKE PREDICTIONS USING GRAPHICAL MODELS
    1. The hardness of computing probabilities
    2. Exact inference
    3. Approximate inference
    4. Causal predictions
  3. LEARNING: ESTIMATING MODELS FROM DATA
    1. Estimating parameters from data
    2. Estimating structure from data
    3. Learning causal models
    4. Bayesian learning
    5. Learning with hidden variables
  4. 4. EXAMPLES OF APPLICATIONS
    1. Classification and density estimation
    2. Information retrieval
    3. Policy making
Audience:

Researchers and practitioners with some background in probability and statistics but little background on graphical models that want a roadmap of the main topics on the area.

Bibliography:

Ricardo Silva concluded this year his PhD in Computational and Statistical Learning from the Center for Automated Learning and Discovery, part of the School of Computer Science in Carnegie Mellon University. He worked on discovering causal models with hidden variables from observational data along with Richard Scheines, Clark Glymour and Peter Spirtes, during which he won important scholarships such as a Microsoft Fellowship and a Siebel scholarship. Currently, he is a post-doc at the Gatsby Computational Neuroscience Unit, University College London.

top

High-Performance Indexing and Query Evaluation for Information Retrieval

Justin Zobel
School of Computer Science and Information Technology
RMIT University, Melbourne, Australia

Extended Abstract:

Text search is a key technology of modern computing. Search engines that index the web provide a breadth and ease of access to information that was inconceivable a decade ago. Text search has also grown in importance at the other end of the size spectrum; the help services built into operating systems, for example, rely on it. Search algorithms have become an element of many different kinds of software.

Basic information retrieval techniques, developed and refined over more than forty years, are well-known. Since 1990, a range of new query evaluation algorithms and index representations have been appeared, including those developed by the presenter of this tutorial, that allow information retrieval queries to be efficiently resolved on document collections containing terabytes of text. The challenges presented by text search have led to development of a wide range of algorithms and data structures. These include compact representations for text indexes, innovative index construction techniques, and efficient algorithms for evaluation of text queries. Systems with indexes based on these techniques can resolve queries with a small fraction of the resources required by traditional approaches to text indexing, and they allow the rapid response expected of web search engines.

The proposed tutorial concerns the practical problems of indexing, querying, storing, and updating large text databases, including those that are the result of web-based information harvesting. While some of these developments are well known and have been consolidated in textbooks, many specific innovations are not widely known, and the textbook descriptions are incomplete. This tutorial introduces the key techniques in the area, describing both a core implementation and how that core can be enhanced through a range of extensions and innovations. The main elements of these innovations include fast index construction techniques, novel index representations, and efficient query evaluation strategies.

Objectives:

Attendees at the tutorial will learn how search engines are implemented and should be able to undertake development of simple algorithms and data structures for text search. The tutorial consolidates recent innovations in search engine implementation, and provides attendees with a basis for further research in the area.

Many of the techniques reviewed are used in the public-domain search engine Zettair, developed at RMIT under the direction of the presenter. Attendees can use the tutorial to learn about Zettair internals and thus how to adapt it for use their own research.

Intended Audience:

The tutorial is designed to cater for the needs of several audiences:

  • Practitioners who wish to become familiar with recent advances in the area of information retrieval implementation; or who wish to to learn about novel techniques for efficient resource usage in information retrieval systems, including techniques for index compression, index construction, and high-performance query evaluation;
  • Established researchers who wish to understand the various implementation issues that must be addressed in practical information retrieval systems; and
  • Students who wish to establish a basis for research in the broad area of information retrieval system implementation.

Prerequisites:

None. Basic concepts of search and retrieval are introduced in the tutorial.

Biographical Information:

Justin Zobel completed his Ph.D. at the University of Melbourne, then joined the academic staff of the School of Computer Science and Information Technology at RMIT University. He is now a Professor and leader of the RMIT Search Engine research group, which recently released the public-domain Zettair search engine. Zobel has published widely in the areas of information retrieval, database systems, text databases, data structures, genomics, and compression, is the author of the text "Writing for Computer Science", and is a coauthor of the texts "Indexing Techniques for Advanced Database Systems" and "Document Computing: Technologies for Managing Electronic Document Collections". He is active in the information retrieval community and has served as Chair and Program Committee member for a range of conferences, including ACM SIGIR. He is currently Treasurer of SIGIR, an editor-in-chief of the Information Retrieval journal, and an associate editor of ACM Transactions on Information Systems and of Information Processing & Management.

The tutorial is co-written by Alistair Moffat, a Professor in the Department of Computer Science at the University of Melbourne. Moffat has published more than 120 refereed papers in the areas of sorting and searching algorithms; text, image, and index compression; and the implementation of information retrieval systems. He is a coauthor of the books "Managing Gigabytes: Compressing and Indexing Documents and Images" and "Compression and Coding Algorithms".

Both Moffat and Zobel have many years experience with implementation of retrieval systems. In the early 1990s, they developed MG, a research prototype text retrieval system that was widely used in contexts such as the international TREC information retrieval project. Both Moffat and Zobel have ongoing research projects in the areas examined in this tutorial.

top