Utilizing Structural Knowledge for Information Retrieval in XML Databases

V. Mihajlovic, Djoerd Hiemstra, H.E. Blok, Peter M.G. Apers

Research output: Book/ReportReportProfessional

23 Downloads (Pure)

Abstract

In this paper we address the problem of immediate translation of eXtensible Mark-up Language (XML) information retrieval (IR) queries to relational database expressions and stress the benefits of using an intermediate XML-specific algebra over relational algebra. We show how adding an XML-specific algebra at the logical level of a DBMS enables a level of abstraction from both query languages for information retrieval in XML and the underlying physical storage and manipulation. We picked a region algebra as a basis for defining the structure aware (SA) view on XML in which we can distinguish among different XML entities, such as element nodes, text nodes, words, and determine their containment relation. Region algebras are already well established in semi-structured document processing as shown in an extensive overview of region algebra approaches in this paper. Furthermore, we propose a variant of region algebra that can support ranking operators in an elegant way while staying algebraic. As relevance scores are computed for regions in our region algebra we named it score region algebra (SRA). The benefits of introducing score region algebra are explained on a set of query examples. Besides abstracting from the query language used and the physical implementation, SRA enables a certain degree of abstraction from the retrieval model used and the opportunity to use the query optimization at the logical level of a database. Various retrieval models can be instantiated at the physical level based on the abstract specification of SRA operators. We also discuss numerous region algebra operator properties that provide a firm ground for query rewriting and optimization at the SA level, which is an important premise for the existence of such a logical view on XML.
Original languageEnglish
Place of PublicationEnschede
PublisherCentre for Telematics and Information Technology (CTIT)
Number of pages42
Publication statusPublished - May 2005

Publication series

NameCTIT technical report series
PublisherCentre for Telematics and Information Technology, University of Twente
No.CTIT-TR-05-19
ISSN (Print)1381-3625

Fingerprint

Dive into the research topics of 'Utilizing Structural Knowledge for Information Retrieval in XML Databases'. Together they form a unique fingerprint.

Cite this