We present PIX, a system that enables flexible and efficient phrase matching in XML documents. Since XML allows structured and unstructured information to be interleaved, XML documents often contain ``mixed content''. Unlike phrase matching on ``flat text'', phrase matching on mixed content raises new challenges. In particular, phrases to match might span document structure. The key features of PIX are (i) flexibility by allowing users to specify which markup and element content to ignore when matching a phrase, (ii) handling exact and approximate matching and returning ranked query answers and, (iii) efficiency by relying on inverted indices and novel algorithms. In addition, approximate querying in PIX enables TopK phrase queries in which ignored tags and ignored element content are relaxed. PIX's functionality is fully integrated into XQuery and naturally combines XPath navigation with phrase matching. PIX is implemented as an extension to GALAX, a full-fledged XQuery engine.
Joint work with Sihem Amer-Yahia, Mary Fernandez, Divesh Srivastava AT&T Labs Research
Online Demo available after June 10 2003
back to Yu Xu's homepage.