Accessing XML Content: From DB and IR Perspectives
Sihem Amer-Yahia
AT & T Labs Research
Florham Park, NJ 07932, USA
Mounia Lalmas
Department of Computer Science
Queen Mary University of London
For more information click the title of the tutorial
Clustering Large and High Dimensional Data
Jacob Kogan
and
Charles Nicholas
Department of Mathematics and Statistics
University of Maryland Baltimore County
For more information click the title of the tutorial
Matching Words and Pictures: Problems, Applications and Progress
Latifur Khan
Department of Computer Science
Erik Jonsson School of Engineering and Computer Science
University of Texas at Dallas
For more information click the title of the tutorial
Support Vector Machines for Information Management
James G. Shanahan
Turn Inc.
San Mateo, CA 94404, U.S.A.
For more information click the title of the tutorial
Probabilistic and Causal Modeling with Graphical Models in Information and Knowledge Management
Ricardo Silva
CALD - Center for Automated Learning and Discovery
Carnegie Mellon University
For more information click the title of the tutorial
High-Performance Indexing and Query Evaluation for Information Retrieval
Justin Zobel
School of Computer Science and Information Technology
RMIT University, Melbourne, Australia
For more information click the title of the tutorial
Accessing XML Content: From DB and IR Perspectives
Authors:
Sihem Amer-Yahia
AT & T Labs Research
Florham Park, NJ 07932, USA
sihem@research.att.com
Tel: +1 973 360 8957
Fax: +1 973 360 8871
Mounia Lalmas
Department of Computer Science
Queen Mary University of London
London, England, E1 4NS
mounia@dcs.qmul.ac.uk
Tel: +44 (0)20 7882 5200
Fax: +44 (0)20 8980 6533
Goal of Tutorial:
Documents today contain a mixture of structured and unstructured
content. One way to format this mixed content is according to the
adopted W3C (World Wide Web Consortium) standard for information
repositories and exchanges, the so-called eXtensible Markup Language
(XML)-http://www.w3.org/XML/.
An ever growing number of XML content is being made available. For
example, the IEEE INEX (the INitiative for the Evaluation of XML)
collection contains IEEE papers in XML form, including structured
information such as the names of authors and the date of publication,
and also unstructured information such as the text content of the
paper. Other examples are Shakespeare's plays in XML, DBLP in XML,
SIGMOD Record in XML, the United States Library of Congress documents
in XML, medical data in XML and the MERIMEE collection of building
descriptions in France. Furthermore, application domains such as the
Library Science have growing needs to seamlessly query over both
structured and unstructured content. As a result, a lot of approaches
have been developed in the past few years to store and query such
mixed content.
The development of approaches to access XML content has generated a
wealth of issues that are being addressed by both the database (DB)
and information retrieval (IR) communities. The DB community has
traditionally focused on developing query languages and efficient
evaluation algorithms for highly structured content. In contrast, the
IR community has focused on searching unstructured content, and has
developed various techniques for ranking query results and evaluating
their effectiveness. Recent trends in DB and IR research demonstrate
a growing interest in adopting IR techniques in DBs and vice versa for
accessing XML content.
In this tutorial, we propose to survey the main challenges that arise
from accessing XML content from both the DB and IR perspectives. In
particular, we will describe the W3C and INEX efforts in both
communities and the challenges behind storing XML documents, designing
query languages, appropriate scoring and ranking methods,
implementation architectures and query evaluation algorithms and,
summarize open research issues. We believe that this tutorial is
necessary to drive the attention of DB and IR researchers and
practitioners to participate together in solving these issues.
In the rest of the proposal, we provide an overview of the
organization of this tutorial, the intended audience and the authors'
biographies. A selected DB and IR bibliography is also included at the
end of the proposal.
This tutorial, in this form, has not presented at any other conference
before. However, Sihem Amer-Yahia gave a lecture on full-text search
in XML at the 7th Extending DataBase Technology summer school on XML
and databases (EDBT'05) in Sardinia, Italy in September 2004. She also
organized a panel at the last Special Interest Group for the
Management of Data (SIGMOD'05) conference in Baltimore USA in June
2005 on the integration of DB and IR technologies and is presenting a
tutorial on XML Full-Text Search at the Very Large DataBases (VLDB'05)
conference in Trondheim, Norway in August 2005. In addition, Mounia
Lalmas will be giving a lecture on structure/XML retrieval at the 5th
European Summer School in Information Retrieval (ESSIR'05) in Dublin
in September 2005, where the IR approach to accessing XML content will
be presented. She has also given two keynote speeches on IR approaches
to XML retrieval and their evaluation, at the 5th Dutch-Belgian
Information Retrieval Workshop (DIR'05) in Utrecht, The Netherlands,
in January 2005, and the Premiere COnference en Recherche
d'Information et Applications (CORIA'04), in Toulouse, France in March
2004.
Content of Tutorial:
The tutorial is organized in 3 parts. The first part presents XML
documents and queries, the W3C and INEX efforts and their scope, and
describes the challenges behind storing and accessing XML content. The
second part of the tutorial presents the DB (often referred to as the
data-centric) and IR (often referred to as the document-centric) views
of XML. The DB view includes storing XML documents in relational
databases, query language such as XPath, XQuery and XQuery Full-Text
as well as query evaluation issues. The IR view includes retrieval
methods for ranking XML elements and assessing the effectiveness of
retrieval results. The last part of the tutorial concludes with the
main challenges ahead in both communities.
Part I: Introduction and Motivation
The first part will provide a brief introduction of what XML is and
its goals. We will then describe the two efforts, one in DB research
and one in IR research, as they can be viewed as having set the scene
for research in XML content access, from motivation, challenge,
development, implementation and validation. In the DB area, we will
discuss the W3C effort in designing the XQuery 1.0 and
XPath 2.0 languages that provide powerful primitives to
navigate the structure of XML documents. Many database researchers and
practitioners have influenced the design of these languages, which
therefore impact on DB approaches to access XML content. On the other
hand, in the IR area, INEX was set up at the beginning of 2002 with
the aim to establish infrastructures, XML test-suites and appropriate
measurements for evaluating the performance (efficiency and
effectiveness) of XML information access approaches. INEX has proposed
its own query language, NEXI, an enhanced subset of XPath that deal
specifically with content-oriented aspects.
Part II: Accessing XML Content
The DB View:
The DB community has a data-centric view and focused on issues related
to storing XML documents in relational databases. Applications, such
as electronic markets, that produce and consume large volumes of data
require efficient and reliable storage and retrieval of XML data.
Many techniques for XML storage have been proposed, including flat
files, relational database management systems, object-oriented
database systems, LDAP directories, and native XML database
systems. We will review the main solutions proposed for XML
storage. We will then describe XQuery, a SQL-like W3C language for
querying XML and XPath, its core navigation language. We will also
describe XQuery Full-Text, an extension to XPath/XQuery to support XML
full-text search, that is being developed within the W3C. Finally, we
will focus on efficient evaluation techniques for XML that were
proposed in the DB community.
The IR View:
The IR view is concerned with the content-oriented retrieval of XML
documents, and aims at supporting more precise access to XML documents
by retrieving XML document components (the so-called XML elements)
instead of whole documents in response to users' queries. Therefore,
in principle, XML elements of any granularity (for example a paragraph
or the section enclosing it) are potential answers to a query, as long
as they are relevant. This constitutes a major departure from
traditional IR and introduces new challenges. A first challenge is
that XML retrieval systems need not only find relevant information in
the XML documents, but also determine the appropriate level of
component granularity to return to the user. A second challenge is
that the retrieved XML elements need not only be at the right level of
granularity, but also to provide the minimum context needed (for
users) to understand the content of that element. In this sub-part we
will discuss approaches based on traditional IR models (e.g. the
vector space model, probabilistic models and language models) that
have been adapted to the XML retrieval task. We will also discuss
approaches that have been specifically developed for the XML retrieval
task. The query language for content-oriented XML access, NEXI
proposed at INEX will also be discussed and compared to XQuery
Full-Text. We will finish with issues related to the evaluation of
content-oriented XML retrieval, which required major re-design of the
standard IR evaluation methodology, to deal with the nested nature of
XML elements.
Part III: Open Research
The aim of this last part of the tutorial is to bridge together common
goals and achievements in the DB and IR communities and summarize some
of the common research challenges that would benefit from a better
integration of DB and IR efforts. In particular, we will focus on
full-text search in XML, approximate querying and ranking and the
necessity for formalizing scoring methods and understanding efficiency
and effectiveness issues in integrating queries on both structure and
content. We will describe recent DB research on efficiently evaluating
top-k queries in XML and recent IR research on relevance assessment
granularity of retrieval for XML.
Intended Audience and Prerequisites:
We target researchers in DB and IR, software and application
developers, and the XML community.
The audience is expected to know basic concepts on the relational data
model, query processing, indexing and retrieval models, and IR
evaluation methodology. Knowledge of XML is not required.
Biographies:
Sihem Amer-Yahia is a Senior Technical Specialist at AT\&T Labs
Research. She received her Ph.D. degree from the University Paris
XI-Orsay and INRIA. She has been working on various issues related to
XML query processing. Sihem is a co-editor of the XQuery Full-Text
language specification and use cases published in April 2005 by the
W3C Full-Text Task Force. She is one of the main participant to the
GalaTex project (www.galaxquery.org/galatex), a conformance
implementation of XQuery Full-Text.
Mounia Lalmas is a reader in information retrieval at the Department
of Computer Science, as Queen Mary University of London. Here research
focuses on the development and evaluation of intelligent access to
interactive heterogeneous and complex information repositories. She
has applied uncertainty theories for the modeling of IR tasks in an
expressive and unified manner and covering a range of domains such as
Web, XML, and MPEG-7. She is the co-leader of the INEX initiative,
with over 50 participating organizations worldwide.
Bibliography:
-
S. Amer-Yahia, C. Botev, J. Shanmugasundaram.
TeXQuery: A Full-Text Search Extension to XQuery.
WWW 2004.
-
R. Baeza-Yates, B. Ribeiro-Neto.
Modern Information Retrieval.
Addison-Wesley, 1999.
-
J. Bosak.
The plays of Shakespeare in XML.
http://www.oasis-open.org/cover/bosakShakespeare200.html
-
J. M. Bremer, M. Gertz.
XQuery/IR: Integrating XML Document and Data Retrieval.
WebDB 2002.
-
E. W. Brown.
Fast Evaluation of Structured Queries for Information Retrieval.
SIGIR 1995.
-
D. Carmel, Y. S. Maarek, M. Mandelbrod, Y. Mass, A. Soffer.
Searching XML Documents via XML Fragments.
SIGIR 2003.
-
T. T. Chinenyanga, N. Kushmerick.
Expressive and Efficient Ranked Querying of XML Data.
WebDB 2001.
-
S. Cohen et al.
XSEarch: A Semantic Search Engine for XML.
VLDB 2003.
-
W. B. Croft.
Language Models for Information Retrieval.
Invited Talk.
ICDE 2003.
-
DBLP in XML.
http://dblp.uni-trier.de/xml/
-
N. Fuhr, K. Grossjohann.
XIRQL: An Extension of XQL for Information Retrieval.
SIGIR 2000.
-
N. Fuhr, M. Lalmas, S. Malik, and Z. Szlavik, editors.
Advances in XML Information Retrieval, Third International
Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2004,
volume 3493 of Lecture Notes in Computer Science. Springer, 2005.
-
N. Fuhr, T. Rolleke.
A Probabilistic Relational Algebra for the Integration of
Information Retrieval and Database Systems.
ACM TOIS 15(1), 1997.
-
L. Guo, F. Shao, C. Botev, J. Shanmugasundaram.
XRANK: Ranked Keyword Search over XML Documents.
SIGMOD 2003.
-
Health Level Seven.
http://www.hl7.org
-
V. Hristidis, L. Gravano, Y. Papakonstantinou.
Efficient IR-Style Keyword Search over Relational Databases.
VLDB 2003.
-
Initiative for the Evaluation of XML Retrieval.
http://inex.is.informatik.uni-duisburg.de:2004/
-
G. Kazai, M. Lalmas, and A. P. de Vries.
The overlap problem in content-oriented xml retrieval evaluation.
In Proceedings of the 27th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, pages
72--79, 2004.
-
G. Kazai, M. Lalmas, N. Fuhr, and N. Govert.
A report on the first year of the initiative for the evaluation of
xml retrieval.
JASIST, 55(6):551--556, 2004.
-
J. Kamps, M. de Rijke, and B. Sigurbjornsson.
Length normalization in xml retrieval.
In Proceedings of the 27th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, pages
80--87, 2004.
-
Library of Congress.
http://lcweb.loc.gov/crsinfo/xml/
-
S. Liu, Q. Zou, and W. W. Chu.
Configurable indexing and ranking for xml information retrieval.
In Proceedings of the 27th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, pages
88--95, 2004.
-
MERIMEE.
http://www.culture.gouv.fr/documentation/merimee/accueil.htm
-
B. Piwowarski and M. Lalmas.
Providing consistent and exhaustive relevance assessments for xml
retrieval evaluation.
In Proceedings of the 2004 ACM CIKM International Conference on
Information and Knowledge Management, pages 361--370, 2004.
-
S. Robertson.
The probability ranking principle in IR.
Journal of Documentation 33, 1977.
-
G. Salton, M. J. McGill.
Introduction to Modern Information Retrieval.
McGraw-Hill, 1983.
-
D. Shin, H. Jang, H. Jin.
BUS: An Effective Indexing and Retrieval Scheme in Structured Documents.
Proc. 3rd Int. Conf. on Dig. Lib., 1998.
-
Sigmod Record in XML.
http://www.acm.org/sigmod/record/xml/
-
A. Trotman and B. Sigurbjornsson.
Narrowed extended xpath i (nexi).
In Advances in XML Information Retrieval, Third International
Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2004,
pages 16--40, 2004.
-
A. Theobald, G. Weikum.
The Index-Based XXL Search Engine for Querying XML Data with
Relevance Ranking.
EDBT 2002.
-
F. Weigel, H. Meuss, K. U. Schulz, F. Bry.
Content and Structure in Indexing and Ranking XML.
WebDB 2004.
-
The World Wide Web Consortium.
XQuery 1.0: An XML Query Language.
W3C Working Draft. http://www.w3.org/TR/xquery/
-
The World Wide Web Consortium.
XML Path Language (XPath) 2.0.
W3C Working Draft. http://www.w3.org/TR/xpath20/
-
The World Wide Web Consortium.
XQuery 1.0 and XPath 2.0 Full-Text Use Cases.
W3C Working Draft. http://www.w3.org/TR/xmlquery-full-text-use-cases/
-
The World Wide Web Consortium.
XQuery 1.0 and XPath 2.0 Full-Text.
W3C Working Draft. http://www.w3.org/TR/xquery-full-text/
-
C. Zhang, J. Naughton, D. DeWitt, Q. Luo, G. Lohman.
On Supporting Containment Queries in Relational Database
Management Systems.
SIGMOD 2001.
top
Clustering Large and High Dimensional Data
Authors:
Jacob Kogan
and
Charles Nicholas
Department of Mathematics and Statistics
University of Maryland Baltimore County
Objective
To provide a concise review of some existing clustering techniques and
present recent developments relevant to information and knowledge
management. To identify challenging problems facing the development
of future clustering algorithms capable of handling large and high
dimensional sets of data.
Contents
Knowledge Management and Clustering
Large and often high dimensional data sets are now increasingly common
and available. Clustering techniques are used to discover natural
groups in data sets, and to identify abstract structures that might
reside there, without having necessarily having any background
knowledge of the characteristics of the data. Clustering has been used
in a variety of areas, including: computer vision, VLSI design, data
mining, bio--informatics (e.g. gene expression analysis), and
information retrieval, to name just a few. Within IR, clustering
techniques have been applied in text mining and web page clustering,
among others. Indeed, clustering has been a subject of interest in IR
for many years. The well--known ``cluster hypothesis", for example,
which says that closely associated documents tend to be relevant to
the same requests (van Rijsbergen 1979) suggests that document
clustering should result in more effective, as well as more efficient,
retrieval.
The tutorial provides an overview of clustering, and an introduction
to some recently developed clustering techniques. We place particular
attention on document clustering, and on the applications of modern
nonlinear optimization tools.
Old and New Approaches to Clustering
Clustering is partitioning of a data set into groups of similar
objects called clusters. Although there is considerable interest in
the use and development of clustering methods in information
technology research areas, clustering has a long and rich history in a
number of fields, including biology, psychiatry, archeology, geology,
geography and marketing.
The tutorial offers a brief overview of classical clustering
techniques, and discusses recently developed clustering algorithms
based on numerical linear algebra and nonlinear optimization tools.
Those include, for example:
-
spectral clustering,
-
graph-based clustering algorithms,
-
SVD--based methods,
-
$k$-means like clustering algorithms,
-
Deterministic Annealing,
-
BIRCH,
to name just a few.
Often the quality of a partition is measured by an objective function
associated with a proximity measure, and an optimal partition is one
that optimizes the objective function. While finding the optimal
solution to many clustering problems is known to be NP-hard, various
estimates to optimal solutions and bounds for values of the associated
objective functions are provided by spectral clustering methods
(through relaxation), and nonlinear optimization techniques (through
the dual optimization approach).
A number of additional methods to assess clustering results, such as
confusion matrix, classification accuracy, average purity, average
entropy, mutual information and others will be presented.
Categorization of clustering algorithms is neither straightforward,
nor canonical. Motivated by Text Mining and Information Retrieval
applications, the tutorial aims to highlight advances in clustering
large and high dimensional data sets. Vectors representing documents
are typically very sparse, and it is the sparsity that permits the
algorithms to be efficient.
We present clustering schemes, i.e., a sequence of clustering
algorithms applied to the same data set, so that the output of
algorithm $n$ becomes the input of algorithm $n+1$. As a particular
example we consider SVD--based clustering algorithms followed by
$k$-means like clustering algorithms. While the output of an
SVD--based algorithm provides the initial partition for a $k$-means
like algorithm, the $k$-means like algorithm further improves the
output of the SVD--based algorithm. The algorithms complement and
reinforce one another. Application of the clustering scheme
improves the final partition. Computational complexity associated
with applications clustering schemes will be discussed.
Similarity is fundamental to the definition of a cluster, and the
introduction of a measure similarity (or proximity) between two
patterns is essential for many clustering algorithms. While many of
the classical clustering algorithms work with squared Euclidean
distance, other distance-like functions may be more appropriate
for text data. We shall present and discuss examples of different
pattern proximity measures. In particular, we present recent
clustering developments based on Bregman and Csiszar divergences.
Clustering and Optimization
The classical $k$-means clustering algorithm is probably one of the
most popular techniques for data clustering. While partitioning a
dataset into a prescribed number of clusters, the algorithm alternates
between the computation of clusters and centroids of a dataset
partition attempting to minimize an objective function defined on
partitions of the dataset (i.e. on the set of subsets of the dataset).
The roles of clusters and centroids are interchangeable, in the sense
that when a partition (i.e. a set of clusters) is available one can
compute the corresponding centroids. On the other hand, for a given
set of centroids one can build the corresponding partition.
Rather than focus on partitions one can concentrate on centroids. In
such a case the objective function to be minimized becomes a function
of centroids. Hence, the clustering problem can in fact be formulated
as a minimization of a non-smooth and non-convex function of a vector
argument. A way to address this minimization problem is by
introducing a family of parametrized smooth optimization problems that
converge to the original non-smooth and non-convex objective function.
We provide details of the ``smoothing" approach that can be applied to
a variety of distance--like measures and, in particular, recovers the
Deterministic Annealing (DA) for clustering.
%which was introduced by Rose, Gurewitz and Fox.
Intended audience
Researchers and practitioners engaged in problems relevant to information
and knowledge management, as well as data and knowledge bases. No
prior knowledge of clustering is assumed.
Prerequisites
No prerequisites are required. A knowledge of basic linear algebra
would be helpful.
Name and biography of the instructors
Jacob Kogan
is an Associate Professor in the Department of Mathematics
and Statistics at the University of Maryland Baltimore County.
Dr. Kogan received his Ph.D. in Mathematics from Weizmann Institute of Science,
and has held teaching and research positions at the University of
Toronto and Purdue University. His research interests include Text and
Data Mining, Optimization, Calculus of Variations, Optimal Control Theory,
and Robust Stability of Control Systems. Since 2001 he has also been
affiliated with the Department of Computer Science and Electrical
Engineering, University of Maryland Baltimore County.
Charles Nicholas
is currently Professor and Chair at the Computer Science and Electrical
Engineering Department,
UMBC, where he has been since 1988. He received his Ph.D. from The
Ohio State University in 1988. Dr. Nicholas' research interests
include electronic document processing, information retrieval, and
software engineering. Dr. Nicholas has served five times as the
General Chair of the ACM Conference on Information and Knowledge
Management (CIKM), most recently in 2002. He also twice chaired the
Workshop on Digital Document Processing, PODP'96 and PODDP'98.
top
Matching Words and Pictures: Problems, Applications and Progress
Authors:
Latifur Khan, Ph.D.
Assistant Professor and Director of the UTD Database Laboratory (DBL@UTD)
Department of Computer Science
Erik Jonsson School of Engineering and Computer Science
Box 830688, EC 31
University of Texas at Dallas
Richardson, TX 75083-0688
Email: lkhan@utdallas.edu Voice: (972) 883-4137
URL: http://www.utdallas.edu/~lkhan Fax..: (972) 883-2349
Abstract:
The development of technology generates huge amounts of non-textual information, such as images.
An efficient image annotation and retrieval system is highly desired. Clustering algorithms make
it possible to represent visual features of images with finite symbols. Based on this, many
statistical models, which analyze correspondence between visual features and words and discover
hidden semantics, have been published. These models improve the annotation and retrieval of
large image databases.
In this tutorial, first, we will present current state of the art and its shortcomings. We will
present some classical models (e.g., translation model (TM), cross-media relevance model etc.).
Second, we will present weighted feature selection algorithm as a solution to the existing
problem. For a given cluster, we determine relevant features based on histogram analysis and
assign greater weight to relevant features as compared to less relevant features. Third, we will
exploit spatial correlation to disambiguate visual features, and spatial relationship will be
constructed by spatial association rule mining. Fourth, we will present the continuous
relevance model and multiple Bernoulli model for avoiding clustering. We will present
mechanisms to link visual tokens with keywords based on these models. Fifth, we will present
mechanisms to improve accuracy of classical model, TM by exploiting the WordNet knowledge-base.
Sixth, we will present a framework to model semantic visual concept in video/images by fusing
multiple evidence with the usage of an ontology. Seventh, we will show that weighted feature
selection is better than traditional ones (TM) for automatic image annotation and retrieval.
Finally, we will discuss open problems and future directions in the domain of image and video.
Aims/Learning objectives:
To provide automated annotation of images/pictures.
Scope:
Data Mining and Multimedia Information Management.
Keywords:
Multimedia Information Retrieval, Image, and Data Mining.
Target Audience:
Researchers and practitioners in the multimedia field with no previous or little knowledge about
data mining and machine learning.
Prerequisites:
Attendants are expected to have the following background:
- Knowledge of basic computer science principles and skills
- Familiarity with basic probability theory
Justification:
In this tutorial our primary focus will be: how low level image features can be linked with
keywords/high level concepts with the usage of clustering, classification, Expected maximization
(EM), and singular value decomposition (SVD). Note that all these topics are related to
data mining and machine learning areas. However, how the low level features will be extracted
from images will not be covered in this tutorial. We assume that we will use current state of
the art from computer vision research, (i.e., blob-world, normalized cut etc.).
Duration:
3 hours.
Enough material such as handouts and slides can be found at:
http://www.utdallas.edu/~lkhan/WWW2005Final.pdf
Biography:
Education
Ph.D. in Computer Science, University of Southern California (USC), Aug, 2000
Dissertation: Ontology-based Information Selection.
M.S. in Computer Science, University of Southern California, Dec, 1996.
B.S. in Computer Science and Engineering, Bangladesh University of Engineering and Technology,
Nov, 1993.
Work Experience
Assistant Professor: Computer Science Dept., University of Texas at Dallas, since Sept, 2000.
Graduate Research Assistant: Integrated Media Systems Center, USC, Sept, 1997--Aug, 2000, and
Information Sciences Institute (ISI), USC, Sept, 1995-Aug, 1997.
Lecturer: Computer Science & Engineering, Bangladesh University of Engineering and Technology,
Dec, 1993 -Aug, 1995.
Service to the Profession
-
Half Day Tutorial, "Matching Words and Pictures - Problems, Applications, and Progress", in 14th ACM International World Wide Web Conference, WWW2005, May 2005, Chiba, Japan, and in MIS 2005 International Workshop on Multimedia Information Systems, Sorrento, Italy.
-
Program Committee Member of 11th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Chicago, August, 2005.
-
Program Committee Member of 17tth IEEE International Conference on Tools with Artificial Intelligence (ICTAI), November, 2005, Hong Kong.
-
Program Committee Member of the 13th International Conference on Cooperative Information Systems (CoopIS 2005), Agia, Cyprus, Oct, 2005.
-
Program Committee Member of International Conference on Database and Expert Systems Applications DEXA 2005, September, 2005, Copenhagen, Denmark.
-
Program Committee Member of IEEE 5th Symposium on Bioinformatics and Bioengineering, Minnesota, 2005.
-
Program Chair of ACM 6th International Workshop on Multimedia Data Mining (MDM/KDD2005), August 2005, Chicago, Illinois.
-
Program Co-Chair in First International Workshop on Geographic and Biological Data Management (GBDM04) in conjunction with COMPSAC 2004, September 28-30, 2004, Hong Kong.
-
Guest Editor, International Journal of Knowledge and Information Systems (KAIS) by Springer-Verlag, a special issue on “Multimedia Data Mining.”
-
Program Committee Member of IEEE International Conference on Multimedia & Expo (ICME) July, 2003, Maryland.
-
Invited Reviewer of IEEE Transaction on Knowledge and Data Engineering (TKDE), ACM Transaction on Information Systems (TOIS), International Journal of Cooperative Information Systems (IJCIS), International Journal of Knowledge and Information Systems (KAIS) by Springer-Verlag.
-
NSF Panelist at IIS division in 2003 and 2004.
Awards
-
A Distributed Component Repository for Rapid Synthesis of Adaptive Real-Time Systems, National
Science Foundation, Sept, 2001 -- Aug, 2004, NGS 0103709.
-
Developing Advanced Middleware for Convergence of IT and Telecommunications, Alcatel USA, Jan.
15, 2004, to Jan. 15, 2005.
Research Interests
Data Mining, Multimedia Information Management, Database Systems, Bio Informatics,
Intrusion Detection, Information Retrieval, Web-based Information Integration and management.
Selected Publications
-
L. Khan, M. Awad and B. Thuraisingham, "A New Intrusion Detection System using Support Vector Machines and Hierarchical Clustering," to appear The VLDB Journal: The International Journal on Very Large Databases, ACM/Springer-Verlag Publishing.
-
F. Luo, L. Khan , F. Bastani, I-Ling Yen and J. Zhou, “A Dynamical Growing Self-Organizing Tree (DGSOT) for Hierarchical Clustering Gene Expression Profiles,” Bioinformatics Journal, 20 (16): 2605-2617 (2004), Oxford University Press, UK.
-
L. Khan, F. Luo, and K. Tang, “Hierarchical Clustering of Complex Data,” to appear in a special issue of the International Journal on Artificial Intelligence Tools, published by World Scientific.
-
L. Khan, D. McLeod, and E. Hovy, “Retrieval Effectiveness of Ontology-based Model for Information Selection,” The VLDB Journal: The International Journal on Very Large Databases, ACM/Springer-Verlag Publishing, Vol. 13(1): 71-85 (2004).
-
L. Khan and D. McLeod, “Audio Structuring and Personalized Retrieval Using Ontologies,” in Proc. of ACM/IEEE Advances in Digital Libraries, Library of Congress, Washington, DC, pp. 116-126, May 2000.
-
L. Khan, “Structuring and Querying Personalized Audio using Ontologies,” in Proc. of ACM Multimedia, Volume 2, Orlando, FL, pp. 209-210, Nov, 1999.
-
L. Khan, D. McLeod, E. Hovy, “A Framework for Effective Annotation of Information from Closed Captions Using Ontologies” to appear in Journal of Intelligent Information Systems, Kluwer Academic Publishers.
-
C. Shahabi, L. Khan, and D. McLeod, “A Probe-Based Technique to Optimize Join Queries in Distributed Internet Databases,” International Journal of Knowledge and Information Systems (KAIS) by Springer-Verlag, vol. 2 no. 3, pp. 372-385, 2000.
-
L. Wang, L. Liu, and L. Khan, “Automatic Image Annotation and Retrieval using Weighted Feature Selection” to appear in international Journal Multimedia Tools and Applications, Kulwer Publisher.
-
L. Khan, D. McLeod, and C. Shahabi, “An Adaptive Probe-Based Technique to Optimize Join Queries in Distributed Internet Databases,” International Journal of Database Management, 12(4): 3-14 (2001).
-
E. Bertino1, L. Khan, R. Sandhu and B. Thuraisingham, “Secure Knowledge Management,” to appear in IEEE Transactions on Systems Man and Cybernetics, Part A.
Advisor:
Dennis McLeod, Computer Science, University of Southern California.
Advisee:
Ph.D.: M. Awad, M. Richard, L. Wang, M. Alam (co-advised), Yuhan Chen (co-advised),
Ryan Layfiield (co-advised), Li Liu (co-advised), N. Tsybulnik (co-advised).
Graduated:
F. Luo (Ph.D. –summer 2004, initial employment at OAK Ridge National Laboratory),
A. Bashir (M.S. thesis), Manish Gupta (M.S. thesis), M. Mirza (M.S. thesis), C. Breen (M.S.),
R. Bhairampally (M.S.), and Y. Rao (M.S.).
top
Support Vector Machines for Information Management
Authors:
Dr. James G. Shanahan
Chief Scientist
Turn Inc.
1400 Fashion Island Blvd. Suite 510
San Mateo, CA 94404, U.S.A.
Tel: 650-522-4308
Abstract:
Support vector machines (SVM) are a general purpose suite of machine learning algorithms for
classification and regression that were introduced by Vapnik and his colleagues in 1992.
Generic support vector machines (SVMs) provide excellent performance on a variety of learning
problems including hand-written character recognition, face detection and, most recently, text
categorization.
Practical applications of SVMs to text can be found in areas such as knowledge management,
process improvement, CRM (Customer Relationship Management), text mining, alerting, intelligence
and law enforcement, spam and porn filtering, and bioinformatics. Most fall under the generic
umbrellas of text classification and adaptive filtering.
Research on customizing SVMs for text processing (ranging from classification to clustering)
has exploded in the past five years. This has resulted in a plethora of new approaches and
many interesting and commercially viable applications. However, the concrete steps required
to reduce SVM theory to practice are often not obvious or clearly explained in the research
literature.
Contents:
This tutorial provides a self-contained and systematic exposition of the following key concepts
in SVMs:
-
Support Vector Machines
-
Linear Separators
-
Primal SVMs
-
Dualspace SVMs
-
Hard and Soft SVMs
-
Kernels
-
Linear, Polynomial, RBF
-
String Kernels
-
Latent Semantic Kernels
-
SVM Learning Algorithms
-
Perceptron
-
Kernel Adatron
-
SMO learning algorithm (and improvements)
-
Quadratic Programming
This tutorial will show how generic SVMs can be customized for text processing tasks such as
classification and filtering. Topics covered here will include:
-
Uneven Margin Based Learning;
-
Thresholding SVMs;
-
Transductive SVMs; and
-
Shrinkage.
SVM-based solutions to actual text-processing problems will be described and compared to other
more traditional approaches.
Discussion of techniques will be supported by live demonstrations.
After attending this tutorial, you should:
-
Understand the fundamentals and the important ideas behind SVMs and kernels, with the help
of illustrative examples in the domain of text classification and filtering.
-
Have an in-depth understanding of how to implement an SVM learning algorithm or use a
publicly available SVM package for the tasks of text classification and filtering.
-
Have technical insight into technologies that you may see in commercial applications such as
knowledge management, process improvement, CRM (Customer Relationship Management), text mining,
alerting, intelligence and law enforcement, spam and porn filtering, and bioinformatics.
-
Understand the intellectual property surrounding SVM technology.
-
Be familiar with the exciting and promising areas of research in this domain.
TARGET AUDIENCE
This tutorial is especially targeted at people who are interested in implementing support
vector machines for text processing tasks; people who are responsible for understanding
implications of using such systems; people who want to understand and extend the core concepts;
and people who want to understand the intellectual property surrounding the core algorithms.
Biography
James G. Shanahan is chief scientist at Turn Inc, where his primary focus is on developing and
exploiting advanced techniques in information science to build a next generation online
advertising network. Prior to joining Turn, he was principal research scientist at Clairvoyance
Corporation where he led the Knowledge Discovery from Text Group. While at Clairvoyance
Corporation he was actively involved in developing text and process mining systems. Previously
he was research scientist at Xerox Research Center Europe (XRCE), Grenoble, where, as a member
of the Co-ordination Technologies Group, he developed Document Souls, a patented
document-centric approach to information access..
He has published five books in the area of machine learning and information processing
including a book on affect and opinion mining. In addition, he has authored over 50 research
publications and is an inventor on numerous patents, five of which have been granted. He has
given numerous tutorials at the Search Engines Conference, and at the ACM CIKM conference. He
has been a member of the organizing and program committees in numerous international conferences
(e.g., ACM SIGIR, ACM CIKM, IEEE FUZZ-IEEE) and workshops and is an active journal reviewer.
He was co-organizer of the AAAI Spring Symposium, EAAT, on Affect and Opinion Modeling
(Stanford, 2004) and a style in text workshop at SIGIR 2005.
top
Probabilistic and Causal Modeling with Graphical Models in Information and Knowledge Management
Ricardo Silva
CALD - Center for Automated Learning and Discovery
Carnegie Mellon University
Abstract:
Probability and causality are the foundations of many decision support systems. Building and
using probabilistic and causal models, however, can be a complex task. One way of approaching
this problem is through graphical models, where dependencies between random variables are
encoded in the language of graphs. Graphical representations are powerful tools for
introducing prior knowledge, estimating models from data, assessing plausibility of estimated
models, and to perform approximate inference where standard probabilistic models are
computationally intractable. This is possible because in many real-world domains there is a
natural structure that allows one to decompose a problem into independent substructures.
Goal of Tutorial:
In this tutorial, we give an introduction on how to use graphical models. The goal is to
provide an overview of the fundamental issues, from modeling background knowledge to
statistical and computational issues that arise when applying graphical models. The focus
will be more on the ideas underlying popular approaches rather than technical details. The
tutorial will include software demonstrations and examples in applications such as information
retrieval and clustering.
OUTLINE
-
INTRODUCTION TO GRAPHICAL MODELS
-
Concepts of probability and causality
-
The advantages of graphical models
-
Knowledge representation through graphical models
-
Directed acyclic graphs
-
Overview of other types of graphs
-
INFERENCE: HOW TO MAKE PREDICTIONS USING GRAPHICAL MODELS
-
The hardness of computing probabilities
-
Exact inference
-
Approximate inference
-
Causal predictions
-
LEARNING: ESTIMATING MODELS FROM DATA
-
Estimating parameters from data
-
Estimating structure from data
-
Learning causal models
-
Bayesian learning
-
Learning with hidden variables
-
4. EXAMPLES OF APPLICATIONS
-
Classification and density estimation
-
Information retrieval
-
Policy making
Audience:
Researchers and practitioners with some background in probability and statistics but little
background on graphical models that want a roadmap of the main topics on the area.
Bibliography:
Ricardo Silva concluded this year his PhD in Computational and Statistical Learning from the
Center for Automated Learning and Discovery, part of the School of Computer Science in Carnegie
Mellon University. He worked on discovering causal models with hidden variables from
observational data along with Richard Scheines, Clark Glymour and Peter Spirtes, during which
he won important scholarships such as a Microsoft Fellowship and a Siebel scholarship.
Currently, he is a post-doc at the Gatsby Computational Neuroscience Unit, University College
London.
top
High-Performance Indexing and Query Evaluation for Information Retrieval
Justin Zobel
School of Computer Science and Information Technology
RMIT University, Melbourne, Australia
Extended Abstract:
Text search is a key technology of modern computing. Search engines
that index the web provide a breadth and ease of access to information
that was inconceivable a decade ago. Text search has also grown in
importance at the other end of the size spectrum; the help services
built into operating systems, for example, rely on it. Search
algorithms have become an element of many different kinds of software.
Basic information retrieval techniques, developed and refined over more
than forty years, are well-known. Since 1990, a range of new query
evaluation algorithms and index representations have been appeared,
including those developed by the presenter of this tutorial, that allow
information retrieval queries to be efficiently resolved on document
collections containing terabytes of text. The challenges presented by
text search have led to development of a wide range of algorithms and
data structures. These include compact representations for text
indexes, innovative index construction techniques, and efficient
algorithms for evaluation of text queries. Systems with indexes based
on these techniques can resolve queries with a small fraction of the
resources required by traditional approaches to text indexing, and they
allow the rapid response expected of web search engines.
The proposed tutorial concerns the practical problems of indexing,
querying, storing, and updating large text databases, including those
that are the result of web-based information harvesting. While some of
these developments are well known and have been consolidated in
textbooks, many specific innovations are not widely known, and the
textbook descriptions are incomplete. This tutorial introduces the key
techniques in the area, describing both a core implementation and how
that core can be enhanced through a range of extensions and
innovations. The main elements of these innovations include fast index
construction techniques, novel index representations, and efficient
query evaluation strategies.
Objectives:
Attendees at the tutorial will learn how search engines are implemented
and should be able to undertake development of simple algorithms and
data structures for text search. The tutorial consolidates recent
innovations in search engine implementation, and provides attendees
with a basis for further research in the area.
Many of the techniques reviewed are used in the public-domain search
engine Zettair, developed at RMIT under the direction of the
presenter. Attendees can use the tutorial to learn about Zettair
internals and thus how to adapt it for use their own research.
Intended Audience:
The tutorial is designed to cater for the needs of several audiences:
-
Practitioners who wish to become familiar with recent advances
in the area of information retrieval implementation; or who wish to
to learn about novel techniques for efficient resource usage in
information retrieval systems, including techniques for index
compression, index construction, and high-performance query
evaluation;
-
Established researchers who wish to understand the various
implementation issues that must be addressed in practical
information retrieval systems; and
-
Students who wish to establish a basis for research in the broad
area of information retrieval system implementation.
Prerequisites:
None. Basic concepts of search and retrieval are introduced in the tutorial.
Biographical Information:
Justin Zobel completed his Ph.D. at the University of Melbourne, then
joined the academic staff of the School of Computer Science and
Information Technology at RMIT University. He is now a Professor and
leader of the RMIT Search Engine research group, which recently
released the public-domain Zettair search engine. Zobel has published
widely in the areas of information retrieval, database systems, text
databases, data structures, genomics, and compression, is the author of
the text "Writing for Computer Science", and is a coauthor of the texts
"Indexing Techniques for Advanced Database Systems" and "Document
Computing: Technologies for Managing Electronic Document Collections".
He is active in the information retrieval community and has served as
Chair and Program Committee member for a range of conferences,
including ACM SIGIR. He is currently Treasurer of SIGIR, an
editor-in-chief of the Information Retrieval journal, and an associate
editor of ACM Transactions on Information Systems and of Information
Processing & Management.
The tutorial is co-written by Alistair Moffat, a Professor in the
Department of Computer Science at the University of Melbourne. Moffat
has published more than 120 refereed papers in the areas of sorting and
searching algorithms; text, image, and index compression; and the
implementation of information retrieval systems. He is a coauthor of
the books "Managing Gigabytes: Compressing and Indexing Documents and
Images" and "Compression and Coding Algorithms".
Both Moffat and Zobel have many years experience with implementation of
retrieval systems. In the early 1990s, they developed MG, a research
prototype text retrieval system that was widely used in contexts such
as the international TREC information retrieval project. Both Moffat
and Zobel have ongoing research projects in the areas examined in this
tutorial.
top
|