Utilizing Structural Knowledge for Information Retrieval in XML Databases

V. Mihajlovic, Djoerd Hiemstra, H.E. Blok, Peter M.G. Apers

Research output: Book/ReportReportProfessional

19 Downloads (Pure)

Abstract

In this paper we address the problem of immediate translation of eXtensible Mark-up Language (XML) information retrieval (IR) queries to relational database expressions and stress the benefits of using an intermediate XML-specific algebra over relational algebra. We show how adding an XML-specific algebra at the logical level of a DBMS enables a level of abstraction from both query languages for information retrieval in XML and the underlying physical storage and manipulation. We picked a region algebra as a basis for defining the structure aware (SA) view on XML in which we can distinguish among different XML entities, such as element nodes, text nodes, words, and determine their containment relation. Region algebras are already well established in semi-structured document processing as shown in an extensive overview of region algebra approaches in this paper. Furthermore, we propose a variant of region algebra that can support ranking operators in an elegant way while staying algebraic. As relevance scores are computed for regions in our region algebra we named it score region algebra (SRA). The benefits of introducing score region algebra are explained on a set of query examples. Besides abstracting from the query language used and the physical implementation, SRA enables a certain degree of abstraction from the retrieval model used and the opportunity to use the query optimization at the logical level of a database. Various retrieval models can be instantiated at the physical level based on the abstract specification of SRA operators. We also discuss numerous region algebra operator properties that provide a firm ground for query rewriting and optimization at the SA level, which is an important premise for the existence of such a logical view on XML.
Original languageUndefined
Place of PublicationEnschede
PublisherDatabases (DB)
Number of pages42
Publication statusPublished - May 2005

Publication series

NameCTIT-TR-05
PublisherUniversity of Twente, Centre for Telematics and Information Technology (CTIT)
No.19

Keywords

  • EWI-5740
  • IR-53300
  • METIS-225861

Cite this