Content Analysis Standards development Heterogeneity MEtadata REtrieval Newsletters
   
Latest Newsletter
2005-08-12
XPath & XQuery
XPath & XQuery

Newsletter Discussion


Abstract

XIRQL (An XML Query Language Based on Information Retrieval Concepts) incorporates concepts like passage retrieval, precision search, precision (combination) plain text search, weighting, relevanceoriented search, data types and vague predicate, structural relativism. The HyREX (Hypermedia Retrieval Engine for XML) server accepts XIRQL queries and returns pointers to the retrieved elements.

XIRQL

An XML Query Language Based on Information Retrieval Concepts

More and more XML is acknowledged as a standard document format. XML separates logical markup from its layout. This property of XML makes it unique and offers a wide range of opportunities for information retrieval as listed below:

Passage retrieval: The logical structure of XML facilitates the retrieval of relevant parts of the document to the given query thus overcoming the limitations of passage retrieval in IR.

Precision search: Based on the markup of specific elements, high-precision searches can be performed that look for content occurring in specific elements (e.g. distinguishing between the sender and the addressee of a letter).

Precision(combination) plaintext search: The concept of mixed content allows for the combination of high precision searches with plain text search. An element contains mixed content if both plain text as well as other elements may occur in it. Thus, it is possible to mark up specific items occurring in a text. For example, in an arts encyclopaedia, names of artists, places they worked, and titles of pieces of art may be marked up (thus allowing for example, to search for Picasso’s paintings of toreadors, avoiding passages mentioning Picasso’s frequent visits to bull fights).

These requirements are addressed by XQL, Xquery and XPath. They would to be a good starting point for IR on XML documents. However, following features should be added to them:

Weighting: IR research has shown that document term weighting as well as query term weighting are necessary tools for effective retrieval in textual documents. So comparisons in XQL referring to the text of elements should consider index term weights. Furthermore, query term weighting should also be possible, by introducing a weighted sum operator (e.g. 0.6 · "XML" + 0.4 · "retrieval"). These weights should be used for computing an overall retrieval status value for the elements retrieved, thus resulting in a ranked list of elements.

Relevance-oriented search: The query language should also support traditional IR queries, where only the requested content is specified, but not the type of elements to be retrieved. In this case, the IR system should be able to retrieve the most relevant elements.

Data types and vague predicates: The standard IR approach for weighting supports vague searches on plain text only. XML allows for a fine grained markup of elements, and thus, there should be the possibility to use special search predicates for different types of elements. For example, for an element containing person names, a similarity search for proper names should be offered; in technical documents, elements containing measurement values should be searchable by means of the comparison predicates > and < operating on floating point numbers. Thus, there should be the possibility to have elements of different data types, where each data type comes with a set of specific search predicates. In order to support the intrinsic vagueness of IR, most of these predicates should be vague as well(e.g. search for measurements that were taken at about 20 degrees).

Structural relativism: XQL is closely tied to the XML syntax, but it is possible to use syntactically different XML variants to express the same meaning. For example, particular information could be encoded as an XML attribute or as an XML element. As another example, a user may wish to search for a value of a specific datatype in a document (e.g. a person name), without bothering about the element. Thus, appropriate generalisations should be included in the query language.

XIRQL incorporates the above mentioned concepts. Based on the concepts described in this paper, a retrieval engine named HyREX (Hypermedia Retrieval Engine for XML) has been implemented. In order to set up a document base with HyREX, first the XML Schema descriptions (along with the HyREX-specific application information) for the documents must be specified. Given the documentbase schema, the system accepts XML documents, indexes them and creates its internal index structures. Currently, B*-trees and variants of inverted lists are used for this purpose. Following this step, the HyREX server accepts XIRQL queries and returns pointers to the elements retrieved. In order to use HyREX as a standalone retrieval system, a simple (Web-based) user interface (HyGate) is developed that accepts query formulations either in XIRQL or based on application specific forms, sends the query to the server and receives result lists as well as single result elements. For presenting the output in HyGate, the document base administrator has to specify appropriate XSLT stylesheets, both for the results survey page(s) and the display of single result elements. HyREX is designed as an extensible IR architecture. For specific applications, new datatypes can be added to the system, possibly together with new index structures.

There are some open issues to be considered while using the XIRQL. At the system level, there is a question of appropriate access methods and query processing strategies. For the user interface, it is not clear in which form end users should formulate their queries. The presentation of results to the user has to be sorted as some of the result elements may belong to the same document.

Reference:   Norbert Fuhr, Kai Großjohann(2002)

XIRQL: An XML Query Language Based on Information Retrieval Concepts

   
Impressum
Cashmere - int RSS Feed
 
Valid XHTML 1.0!
Newsletters
Webmaster