Annotated bibliography

Until 2004 or so I maintained annotations on the publications. Primary links in PDF or HTML. ps also available. Talk slides in html.

1.            A. Balmin, Y. Papakonstantinou, V. Vianu Incremental Validation of XML Documents” In ACM TODS. It extends the ICDT 2003 paper with the "local validation" technique, which allows the O(m) validation of m updates against DTDs or XML Schemas whose regular expressions are ``local". It turns out that locality is a very common property: for example, in the 64 DTDs found at OASIS, 98% of regular expressions are local. The underlying data structure is extremely simple (just counters). For non-local DTDs and XML Schemas one should use the ICDT 2003 technique.

2.            A. Balmin, V. Hristidis, Y. Papakonstantinou Authority-Based Keyword Queries in Databases using ObjectRank” In VLDB 2004. Brings authority-based search to databases. Check out also the demo on DBLP data.

3.            A. Deutsch, Y. Papakonstantinou, Y. Xu The NEXT Logical Framework for XQuery” In VLDB 2004. We introduce Nested XML Tableaux that enable the application of various logical optimizations to XQuery. A key component of NEXT is a group-by construct that allows us to consolidate the multiple appearances of navigation in an XQuery FLWR to a single spot. As a proof of concept we present the minimization of XQueries using NEXT.

4.            V. Hristidis, L. Gravano, Y. Papakonstantinou Efficient IR-Style Keyword Search over Relational Databases” In VLDB 2003. Applications in which plain text coexists with structured data are pervasive. Commercial relational database systems provide information retrieval (IR) ability (mainly keyword search) for text attributes but this search functionality requires that queries specify the exact column against which the list of keywords will be matched. However, good answers to a keyword query may need to be assembled by joining tuples from multiple relations. this observation has motivated recent research of free-form keyword search over relational databases (including UCSD's DISCOVER and XKeyword) but IR ranking strategies have not been used - rather, results are ranked by the proximity of the keywords only. This work combines IR-style document relevance ranking strategies with DISCOVER's ability to find related tuples that contain the requested set of keywords. A key performance point is the ability to find the top-k results efficiently. A number of algorithms are presented and evaluated.

5.            V. Hristidis, Y. Papakonstantinou Algorithms and Applications for Answering Ranked Queries Using Ranked Views” In VLDB Journal. Extended Version of SIGMOD 2001: We extend the class of functions for which the algorithm of SIGMOD 2001 works, provide multiple performance improvements (eg., we only materialize top-M tuples of each view), and provide speculative algorithms. We describe the implementation and we also provide an application of the core algorithm in merging ranked data from multiple sources.

6.            M. Petropoulos, A. Deutsch, Y. Papakonstantinou Query Set Specification Language (QSSL)” In WebDB 2003 ((Talk in html). WSDL-based web services are becoming a popular way of making information sources available on the web. We argue that the function signature paradigm that is used today by web services cannot capture the query capabilities provided by structurally rich and functionally powerful information sources, such as relational databases. We propose the Query Set Specification Language (QSSL) that allows the concise description of sets of parameterized XPath queries. A QSS is embedded in a WSDL specification to form a specialized type of web services, called Data Services. Data Services connect the calls that the source accepts with the underlying schema.

7.            Y. Papakonstantinou, V. Vianu Incremental Validation of XML Documents” In ICDT 2003. (Talk in ppt) We investigate the incremental schema validation of XML data with respect to DTDs and XML Schemas under updates consisting of element tag renamings, insertions and deletions. For DTDs we provide an O(m log n) incremental validation algorithm using an auxiliary structure of size O(n), where n is the size of the document and m is the number of updates. Note that a brute force solution requires O(n) steps: following an update, we run the updated lists of elements through automatons that correspond to regular expressions in the DTD. For XML Schemas, the problem is harder, since an update to a single node may have global repercussions for the typing of the tree. We provide an O(m log^2 n) algorithm that is based on the use of tree automata and an O(n) auxiliary data structure.

8.             V. Hristidis, Y. Papakonstantinou, A. Balmin Keyword Proximity Search on XML Graphs” In ICDE 2003. XKeyword inherits the query processing technology of DISCOVER, since it is an XML database built on top of a relational database. XML data are shredded into relational and XKeyword optimizes its performance by materializing a set of indexed path relations, where each one describes a particular pattern of paths in the XML graph. Furthermore, XKeyword addresses the information overload exhibited by prior keyword proximity search systems (including DISCOVER), where the number of connections is huge and many connections are inferred from others. XKeyword provides an interactive result graph that allows the user to navigate, “drill-down”, and “roll-up” in the results so that he can focus on the most interesting connections. A demo of XKeyword on the DBLP data is available at http://kebab.ucsd.edu:81/xkeyword/.

9.            B. Ludascher, P. Mukhopadhay, Y. Papakonstantinou A Transducer-Based XML Query Processor” (ps) In VLDB 2002 (Talk). The XML Stream Machine (XSM) system processes XQueries using a paradigm that is tuned to the efficient processing of sequentially accessed XML data (streams). The system compiles a given XQuery into an XSM, which is an XML stream transducer, i.e., an abstract device that inputs one or more XML streams, outputs one or more output streams and potentially uses internal buffers. The paper presents a systematic construction of the XSM: First a network of XSMs is built (one for each XQuery operator). Then the network is reduced to a single XSM by repeated application of an XSM composition operator. The resulting XSM is optimized using schema knowledge. In particular, each state of the transducer is annotated with regular languages that constrain and describe the contents of the internal buffers. Based on the knowledge of buffer content the optimizer removes redundant tests, actions, and buffers. Finally, the optimized XSM is compiled into a C program, which performs much better than XSLT processors with which it is compared.

10.        V. Hristidis, Y. Papakonstantinou DISCOVER: Keyword Search in Relational Databases” In VLDB 2002. DISCOVER allows the user to issue keyword queries on a relational database. It returns qualified joining networks of tuples, that is, sets of tuples that are associated because they join on their primary and foreign keys and collectively contain all the keywords of the query. DISCOVER proceeds in two steps. First it finds all candidate networks of relations, i.e., join expressions that generate the joining networks of tuples. DISCOVER uses the structure of the schema to guarantee that no redundant networks are produced. Then it builds plans for the efficient evaluation of candidate networks, exploiting the common subexpression structure that typically appears. Choosing the most efficient sequence of computing subexpressions is NP-complete. We provide and evaluate a near-optimal greedy optimizationalgorithm.

11.        Y. Papakonstantinou, V. VassalosArchitecture and Implementation of an XQuery-based Information Integration Platform” (ps) In Data Engineering Bulletin, March 2002. Brief description of the Enosys integration platform, with emphasis on the query processing aspects of it.

12.        Y. Papakonstantinou, M. Petropoulos, V. Vassalos QURSED: Querying and Reporting Semistructured Data” In ACM SIGMOD 2002. (Talk in ppt) We present a system for the development of Web-based query forms and reports (QFRs) that query and report semistructured XML data, i.e., data that are characterized by nesting, irregularities and structural variance. The query aspects of a QFR are captured by its query set specification, which formally encodes multiple parameterized, possibly interdependent condition fragments and can describe large numbers of queries. The run-time component of the system produces XQueries by synthesizing fragments from the query set specification that have been activated during the interaction of the end-user with the QFR. The design-time component semiautomates the construction of the QFR and its coupling with the visual aspects of the Web-based form and report.

13.        P. Mukhopadhay, Y. Papakonstantinou Mixing Querying and Navigation in MIX” (ps) In Data Engineering (ICDE), 2002. Interleaving of browsing and querying is very common in information discovery. We describe a DOM-like API that supports interleaved querying and browsing of virtual XML views provided by the MIX mediator. The client issues queries, navigates into the result nodes, and uses any result node as the root of a new query. Efficient mixing of querying and browsing requires queries to be evaluated from within the context that prior queries and navigations have created. We describe the decontextualization process that turns a query that is meaningful within a context to an efficient “stand-alone” query. Note that query evaluation is navigation-driven, a’ la the EDBT 2000 paper, so that only data that are actually needed in the course of navigation are generated.

14.        M. Petropoulos, V. Vassalos, Y. Papakonstantinou “Building XML Query Forms and Reports with XQForms" In Computer Networks Journal Special Issue on XML, Elsevier Science, vol 39/5, pp. 541-558, 2002. (Journal version of WWW10.) Describes how the Enosys Markets XQForms system can be used to build query forms and reports. It is particularly interesting to notice the decoupling of query aspects from visual aspects in the architecture and methodology of building XQForms.

15.        Y. Papakonstantinou, V. VassalosThe Enosys Markets Data Integration Platform: Lessons from the Trenches” (pdf) In industrial track of ACM Conference on Information and Knowledge Management (CIKM), 2001. Brief description of the architecture and components of the Enosys Markets integration platform, along with a discussion of key challenges.

16.        V. Hristidis, N. Koudas, Y. Papakonstantinou “PREFER: A System for the Efficient Execution of Multi-Parametric Ranked Queries ”. (ps) In ACM SIGMOD 2001. Preference queries use a weight function over a relation’s attributes to derive a score for each tuple.  Preference queries are needed in operations’ research and many real life apps but database systems cannot efficiently produce the top results of a preference query because they need to evaluate the weight function over all tuples of the relation.  PREFER can pipeline and produce the top results of preference queries efficiently by using materialized views that have been preprocessed and stored. Intuitively, the key requirement is to find a materialized view whose weight function is similar to the query’s weight function.  We show that excellent performance can be delivered by materializing a reasonable number of views and we provide an algorithm that selects a number of views to precompute and materialize given specific space constraints.

  1. M. Petropoulos, V. Vassalos, Y. Papakonstantinou “XML Query Forms (XQForms): Declarative Specification of XML Query Interfaces” (pdf). In WWW 10, 2001. The Enosys Markets XQForms system inputs an XML schema and a declarative specification of query forms and result pages and produces a Web front-end that interacts with the user, emits XMAS queries to an underlying XML data server and formats the results in real time. The user of XQForms can also set up “decision trees” (i.e., answers to specific questions may invoke additional questions), effective result summarization and drill-down of the query result. XQForms can easily adapt to any XML query language supporting path expressions.

17.        A. Balmin, Y. Papakonstantinou, T. Papadimitriou "Hypothetical Queries in an OLAP Environment". (ps) In Very Large DataBases (VLDB) 2000. We present the Sesame system for the efficient modeling and execution of what-if scenarios. We model what-if scenarios as a list of hypothetical modifications on the warehouse view and fact data. The semantics extend conventional view update semantics in order to accommodate the special requirements of OLAP. Execution is based on substitution and rewriting. Substitution enables lazy evaluation of the hypothetical updates, hence saving the system from materializing large warehouses/cubes that correspond to the hypothetical world. The rewriting module uses axiomatic knowledge of the arithmetic, relational, and financial to efficiently rewrite the query. Special rewriting challenges arise from the size and nature of the queries derived by the substitution module and because of the wide variety of operators in the views and query. We present and experimentally evaluate a rewriter that solves the challenges by using two new rewriting techniques, going by the names "minterms" and "packed forests".

18.        K. Munroe, Y. Papakonstantinou "BBQ: A Visual Interface for Integrated Browsing and Querying of XML". (ps) In Visual Database Systems (VDB) 2000. The Blended Browsing and Querying (BBQ) GUI allows one to view the schema (DTD) of an XML source and graphically formulate queries. The tool allows one to express selections, joins, groupings, and creation of new structures simply by clicking and dragging-and-dropping objects. The blending part of it comes in when one uses the result of a query as the source upon which the next query is posed.

19.        K. Munroe, B. Ludascher, Y. Papakonstantinou "Blended Browsing and Querying of XML in a Lazy Mediator System" In Exhibitions section of Extending DataBase Technology (EDBT) 2000.

20.        Y. Papakonstantinou, V. Vianu "DTD Inference for Views of XML Data". (ps) In Principles Of Database Systems (PODS) 2000. The ICDE 99 paper had proposed the soundness and tightness concepts for measuring the descriptive ability of DTDs. This paper shows some strong inherent limitations on the descriptive power of DTDs. It proposes two DTD extensions that enhance their descriptive power: The first extension is the so-called specialized DTDs, which are DTDs with a simple good-old subtyping mechanism. The second extends the regular expressions of DTDs to context-free languages. These two extensions allow us to deliver tight descriptions for the views produced by selection queries with regular path expressions. The paper also describes practical cases where the second extension is not necessary. PS1 (2000): Interestingly, the subtyping mechanisms of specialized DTDs can be expressed by the archetype feature of the XML Schema standard. Next time you'll get involved in a discussion over DTDs VS XML Schemas do not forget to impress your friends and enemies by stating that XML Schemas allow us to produce tight descriptions of XML views. PS2 (2001): Suddenly the XML Schema committee became too sensitive on algorithmic complexity issues and placed conditions on the subtyping constructs in order to guarantee linear time parsing. Tightness evades, once again.

21.        B. Ludascher, Y. Papakonstantinou, P. Velikhov "Navigation-Driven Evaluation of Virtual Mediated Views". (ps) In Extending DataBase Technology (EDBT) 2000. We present the lazy, navigation-driven evaluation of queries/views by the MIX mediator. The mediator exports a virtual view XML document into which the client navigates. Client navigation commands are then translated into source/wrapper sequences of navigation commands. The proposed demand-driven approach is inevitable in order to handle up-to-date mediated views of huge Web sources. The non-materialization of the view is transparent to the client application since clients navigate the query answer using a subset of the standard DOM API interface for XML documents. We describe query evaluation in such an environment.

22.        B. Ludascher, Y. Papakonstantinou, P. Velikhov "A Framework for Navigation Driven Lazy Mediators" (ps). In the WebDB99 workshop, held in conjunction with ACM's SIGMOD99 This paper presents the framework of the super-lazy MIX mediators. The mediator exports a virtual view XML document into which the client navigates. Client navigation commands are then translated into source/wrapper sequences of navigation commands.

23.        Y. Papakonstantinou, V. Vassalos "Query Rewriting for Semistructured Data" (ps) In ACM SIGMOD 99. We present an algorithm that, given a semistructured query q and a set of semistructured views V, finds rewriting queries, i.e., queries that access the views and produce the same result with q. The algorithm is based on generalizing containment mappings, the chase, and unification. We prove that all rewriting queries are found and we characterize the complexity. Applications of the algorithm include (i) reformulating queries in order to use materialized views or a cache and (ii) rewriting queries in order to adapt them to the limited capabilities of semistructured sources.

24.        C. Baru, A. Gupta, B. Ludaescher, R. Marciano, Y. Papakonstantinou, P. Velikhov, "XML-Based Information Mediation with MIX" (ps) In Exhibitions Program of ACM SIGMOD 99.

25.        Y. Papakonstantinou, P. Velikhov "Enhancing Semistructured Data Mediators with Document Type Definitions" (ps) (Talk) In Data Engineering (ICDE) 99. Mediators, such as MIX, have to provide DTDs of the views that they compute in order to facilitate query formulation, query processing, and so on. However, a view has more than one DTDs. We propose two quality metrics of view DTDs - roughly similar to "precision" and "recall". The first one, soundness, requires that every possible view document is described by the view DTD. The second one, tightness, requires that the view DTD describes the minimum number of documents that may not appear in the view (ideally, the number is zero). Finally we describe an algorithm for deriving a sound and tight view DTD given a view that has a single variable in the head (SELECT clause) and no recursion (or regular expressions) in the body (WHERE clause).

26.        B. Ludaescher, Y. Papakonstantinou, P. Velikhov, V. Vianu "View Definition and DTD Inference for XML" Post-ICDT Workshop on Query Processing for Semistructured Data and Non-Standard Data Formats, 99.

27.        C. Baru, B. Ludaescher, Y. Papakonstantinou, P. Velikhov, V. Vianu "Features and Requirements for an XML View Definition Language: Lessons from XML Information Mediation" Position paper in W3C's QueryLanguage Workshop.

28.        L. Gravano, Y. Papakonstantinou "Mediating and Metasearching on the Internet" (ps)In Data Engineering Bulletin 21(2), 1998. Discusses the similarities and differences between metasearchers, asknown in the IR community, and mediators, as known in the database community.

29.        V. Vassalos, Y. Papakonstantinou "Using Knowledge of Redundancy for Query Optimization in Mediators" (ps) In AAAI98 Workshop on AI and Data Integration.

30.        Chen Li, R. Yerneni, V. Vassalos, H. Garcia-Molina, Y. Papakonstantinou, J. Ullman "Capability-Based Mediation in TSIMMIS" ( ps) In Exhibits Program of ACM SIGMOD 98.

31.        R. Yerneni, Y. Papakonstantinou, S. Abiteboul, H. Garcia-Molina "Fusion Queries over Internet Databases" (ps) In Extending DataBase Technology (EDBT) 98. Mediators create virtual entities from information that is distributed across multiple sources. Fusion queries select the entities that satisfy some user-specified conditions. The fusion query optimization problem becomes particularly challenging when the conditions have to be tested on multiple sources. We show that under common assumptions a limited class of plans, the semijoin adaptive plans, contain the optimal fusion plan. An experimental evaluation delivers further results on choosingthe optimal fusion plan.

32.        V. Vassalos, Y. Papakonstantinou "Expressive Capabilities Description Languages and Query Rewriting Algorithms" (ps) In Journal of Logic Programming. Extends the formal arguments and improves the algorithms of the VLDB97 paper. In addition, it shows (1) how the query capabilities of a mediator can be automatically computed from the query capabilities of the sources and (2) a limited (yet practically very common) class of descriptions for which the algoriths are much more efficient.

33.        V. Vassalos, Y. Papakonstantinou "Describing and Using Query Capabilities of Heterogeneous Sources". (ps) In VeryLarge DataBases (VLDB) 97. First the paper compares the capabilities expressive power of RQDL and p-Datalog. The latter is a version of Datalog where the expansions of the Datalog program that have only EDBs stand for the set of supported queries. We prove that p-Datalog is not sufficient for expressing the capabilities of "powerful" sources such as SQL DBs; on the other hand RQDL is powerful enough to describe such sources. Then we describe a reduction of RQDL into Datalog with function symbols and based on this reduction some efficient CBR algorithms are derived.

34.        Y. Papakonstantinou, A. Gupta, L. Haas "Capabilities-Based Query Rewriting in Mediator Systems (Extended Version)" (ps) In Distributed And Parallel Databases, 6(1), Jan 98. Extends PDIS 96. It also extends the Capabilities Based Rewriting algorithm with the appropriate hooks for cost optimization and describes what plan pruning may be performed without losing the CBR's completeness. Finally it compares the CBR architecture with a real CBR-enhanced optimizer that was developed for Garlic.

35.        Y.Papakonstantinou, A. Gupta, L. Haas. "Capabilities-Based Query Rewriting in Mediator Systems". (ps) In Parallel and Distributed Information Systems (PDIS) 96. Selected in the "Best of PDIS". A mediator has to make sure that the queries that it sends to the sources are supported by them. Furthermore, the mediator has to leverage as much as possible on the source query capabilities. The described Capabilities-Based Rewriter (CBR) module of a mediator takes as input (i) a description of the queries supported by the underlying sources and (ii) a client Select-Project-Join query. From these it determines SPJ sub-queries to be sent to the sources (the sub-queries are commensurate to the sources' abilities) and computes a plan for combining the sub-queries' results using selection, projections, and joins. The description of the supported queries is done in RQDL, a description language that can be schema independent and can describe infinitely many "query patterns".

36.        H. Garcia-Molina , Y. Papakonstantinou , D. Quass , A. Rajaraman , Y. Sagiv , J. Ullman , V. Vassalos , J. Widom "The TSIMMIS approach to mediation: Data models and Languages". (ps) In Journal of Intelligent Information Systems". The most complete overview of TSIMMIS.

37.        Y.Papakonstantinou, S. Abiteboul, H. Garcia-Molina. "Object Fusion in Mediator Systems". (ps) In Very Large DataBases (VLDB) 96. Mediators fuse information from multiple heterogeneous information sources. In the process they remove redundancies and resolve inconsistencies.The problem becomes harder when the sources are unstructured/semistructured and we do not have complete knowledge of their contents and structure. We show the use of skolem constants in the view definition language of the TSIMMIS mediator.

38.        Y. Papakonstantinou "Query Processing in Heterogeneous Information Sources". (ps) Stanford University Thesis, 1997. Paper saving condensed version. ( ps) Is the result of 4 years at Stanford this document that has been read by3 people? Are such facts demoralizing for graduate students? They should be. Now the ball is in your field: By clicking one of the links you give hope to current and future graduate students who wonder why they spend their 20's to write a document that only three people will read.

39.        S. Adali, S. Candan, Y. Papakonstantinou, V.S. Subrahmanyan. "Query Caching and Optimization in Mediator Systems". (ps) In ACM SIGMOD 96.

40.        Y. Papakonstantinou, H. Garcia-Molina, J. Ullman. "MedMaker: A Mediation System Based on Declarative Specifications". (ps) In Data Engineering (ICDE) 96.

41.        Y.Papakonstantinou, A. Gupta, H. Garcia-Molina, J. Ullman. "A Query Translation Scheme for Rapid Implementation of Wrappers". (ps) In Deductive and Object-Oriented Databases (DOOD) 95.

42.        N. Roussopoulos, C.M. Chen, S. Kelley, A. Delis, Y. Papakonstantinou. "The ADMS Project: Views R Us". (ps ) In IEEE Data Engineering Bulletin, June 95.

43.        Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. "Object Exchange Across Heterogeneous Information Sources". (ps) In Proceedings of Data Engineering (ICDE), Taipei, Taiwan, March 1995. This paper describes the (strikingly similar to XML) semistructured Object Exchange Model (OEM) and describes its suitability in mediating over heterogeneous and Internet sources. The paper also provides the first description of wrappers and mediators. Two years later XML was formulated and by 1998 you'd find hundreds of quotations by industry people on the suitability of XML for mediation and information exchange. We are happy to be there 4 years ahead!

44.        J. Hammer, H. Garcia-Molina, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. "Information Translation, Mediation, and Mosaic-Based Browsing in the TSIMMIS System". ( ps) In Exhibits Program of ACM SIGMOD, San Jose, California, June 1995.

45.        H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, J. Widom. "The TSIMMIS approach to Mediation: Data Models and Languages". ( ps) In NGITS 95

46.        H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. "Integrating and Accessing Heterogeneous Information Sources in TSIMMIS". ( ps) In Proceedings of the AAAI Symposium on Information Gathering, Stanford, California, March 1995.

47.        S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. "The TSIMMIS Project: Integration of Heterogeneous Information Sources". (ps ) In Proceedings of the 100th IPSJ Anniversary Meeting, Tokyo, Japan, October 1994.

48.        G. Papakonstantinou (Dept. of Electr. Eng., Nat. Tech.Univ. of Athens, Greece), T. Panayiotopoulos, N. Georgoulis, and Y. Papakonstantinou."A parallel full theorem prover with uncertainties". In Parallel andDistributed Computing in Engineering Systems. Proceedings of the IMACS/IFACInternational Symposium., pages 117-21, Corfu, Greece, 23-28 June1991.

  1. A. Stafylopatis (Dept. of Electr. Eng., Nat. Tech.Univ. of Athens, Greece), Y. Papakonstantinou, and S. Kaxiras. "PSM: software tool for simulating, prototyping, and monitoring of multiprocessor systems". Information and Software Technology, May 1992, vol.34, no.5, pages 313-25