DataDirect XQuery™ Performance: Generating SQL

Implementing XQuery efficiently for relational databases is not trivial, because XQuery and SQL support different operations on quite different data models. At DataDirect, we emphasized performance and scalability in the design of our XQuery engine from the beginning, with a strong focus on relational data. This paper presents some of the techniques we use to generate efficient SQL for relational databases to implement XQuery for these data sources.

These techniques work, and give us better performance than other XQuery implementations. Some XQuery implementations we have tested return an entire table for queries where we return only part of a single row. Others generate the same SQL regardless of the database involved, a strategy which simply can not offer good performance. Some XQuery implementations rely on the least-common-denominator functionality of the least capable JDBC drivers, which limits performance significantly. Some XQuery implementations perform most XQuery functions in the XQuery engine instead of evaluating them in the database.

DataDirect is the leading vendor of database connectivity software, and we know how to measure performance and scalability. We have spent many man-years developing extensive XQuery performance test suites, and we run these suites regularly as part of our standard development cycle. When our support staff identifies interesting customer performance scenarios, these are added to our performance test suites. In our testing, we have been pleased to find that applications written using DataDirect XQuery™ generally perform better than equivalent applications written using SQL, JDBC, and an XML API. Not all XQuery implementations can offer that kind of performance.

In this paper, we will explain some of the techniques we have developed, showing the SQL that DataDirect XQuery generates for a number of specific XQueries. If you want to see the SQL generated for one of your queries, you can do that using the profiler provided by your database vendor. If you are comparing XQuery products, comparing the SQL they generate can help you understand how these products will perform. Most of the generated SQL shown in this paper is for Oracle 10gR2 — SQL generated for other databases may look significantly different.

Overview

Before we explore individual queries, let's take a high-level look at the techniques we use. Most of the specific techniques we show later are based on the following general principles:

  1. Minimize data retrieval.

    Moving data is expensive. In DataDirect XQuery, the SQL we generate is as selective as possible, retrieving only the data needed to create the results of a query. Some XQuery implementations we have tested return an entire table for queries where we return only part of a single row.

  2. Leverage the database.

    In DataDirect XQuery, operations that can be performed in SQL are pushed down into the database, where the relational query optimizer can leverage indexes and other structures. The performance gains this brings are particularly important for joins, Order By clauses, and SQL functions. This also reduces data retrieval, since data need not be retrieved for operations to be done in the database.

  3. Optimize for each database.

    Today's relational databases support significantly different dialects of SQL, and even when two databases support the same operation, their performance may be quite different. Any given database has enough functionality to support XQuery efficiently, but the constructs needed to do this are different for each database. Some XQuery implementations support only one database; others generate the same SQL regardless of the database involved, which results in poor performance. In contrast, DataDirect XQuery uses a different SQL adaptor for each database, generating SQL specifically optimized for that database, based on our extensive performance testing.

  4. Retrieve data efficiently.

    The underlying database drivers can significantly affect performance. As the leading vendor of JDBC, ODBC, and ADO.NET drivers, DataDirect knows how to make our drivers perform while providing good support for all major relational databases. And because we control both the XQuery implementation and the underlying JDBC drivers, we can add optimizations to the drivers to support our implementation. The added cost of XML construction, which can be considerable in some implementations, is negligible in ours.

  5. Support incremental evaluation.

    In many applications, results are returned to the user as soon as they are available, displaying the first results well before the entire query has been performed. Many XML applications are based on streaming architectures. In DataDirect XQuery, we use lazy evaluation so that streaming APIs can retrieve data as soon as it is available. As data is needed, we retrieve it incrementally from JDBC result sets. Because there is no need to have the entire result in memory at one time, very large documents can be created.

  6. Optimize for XML hierarchies.

    Because XML construction is hierarchical, DataDirect XQuery uses SQL algorithms that optimize retrieving data for building hierarchies. For instance, we make extensive use of merge-joins when translating XQuery to SQL.

  7. Give the programmer the last word.

    Every SQL programmer knows that occasionally you need to use hints to get optimal performance for a specific query. This is also true in XQuery, so DataDirect XQuery allows programmers to influence the SQL we generate. This can significantly improve performance in some cases.

Next

Selecting Data

Copyright © 1993 - 2008. Progress Software Corporation. All rights reserved. | N. America: 800 876 3101 | World: +44 (0) 1753 218 930