Data Transfer Strategies

Transferring data between XML documents and relational databases

1.0 Overview

In this paper we will discuss strategies for transferring data between XML documents and relational databases according to two mappings (a table-based mapping and an object-based mapping) commonly used to map DTDs to relational databases. Although the discussion largely focuses on the difference between using SAX- and DOM-based tools to transfer data, it also discusses a number of strategies for traversing both the XML and database hierarchies and the tradeoffs among them.

IMPORTANT: The table-based mapping and the object-based mapping are not discussed in this paper. However, you must understand them before reading this paper. For more information, see Mapping DTDs to Databases.

WARNING: This is not the best paper I have ever written, largely due to lack of time to research all of the issues involved. If you find factual errors -- I'm sure there are more than what have already been found -- please contact me. Thanks.

2.0 Table-based mapping

The table-based mapping views an XML document as a serialized table. That is, the structure of the document must be as follows, where the outermost (<Tables>) element is needed only if the document contains more than one table:

   <Tables>
      <Table>
         <Row>
            <Column_1>...</Column_1>
            ...
            <Column_n>...</Column_n>
         </Row>
         <Row>
            <Column_1>...</Column_1>
            ...
            <Column_n>...</Column_n>
         </Row>
         ...
      </Table>
   </Tables>

The table-based mapping has the advantage of simplicity, which makes it useful as a model for writing data transfer tools. However, because the model applies only to a limited set of XML documents, the tools are of limited use. They are most useful for using XML to transfer data between relational databases or for providing single-table access to the database.

For a complete discussion of the table-based mapping, see see Mapping DTDs to Databases".

2.1 Transferring data from XML to the database

The code to transfer data from XML to the database follows a common pattern, regardless of whether it uses SAX or DOM:

Table element start: prepare an INSERT statement
Row element start: clear INSERT statement parameters
Column elements: buffer PCDATA and set INSERT statement parameters
Row element end: execute INSERT statement
Table element end: close INSERT statement

The code does not make any assumptions about the names of the tags. In fact, it uses the name of the table-level tag to build the INSERT statement and the names of the column-level tags to identify parameters in the INSERT statement. Thus, these names could correspond exactly to the names in the database or could be mapped to names in the database using a configuration file.

Here is the code using SAX for a document containing a single table:

   int state = UNKNOWN;
   PreparedStatement stmt;
   StringBuffer data;
   
   public void startElement(String uri, String name, String qName,
                            Attributes attr) {
      if (state == UNKNOWN) {
         stmt = getInsertStmt(name);
         state = TABLE;         
      }
      else if (state == TABLE) {
         state = ROW;
         stmt.clearParameters();
      } else if (state == ROW) {
         state = COLUMN;
         data = new StringBuffer();
      } else { // if (state == COLUMN)
         throw new SAXException("XML document nested too deep.");
      }
   }
   
   public void characters (char[] chars, int start, int length) {
      if (state == COLUMN)
         data.append(chars, start, length);
   }
   
   public void endElement(String uri, String name, String qName) {
      if (state == TABLE) {
         stmt.close();
         state = UNKNOWN;
      }
      else if (state == ROW) {
         stmt.executeUpdate();
         state = TABLE;
      } else if (state == COLUMN) {
         setParameter(stmt, name, data.toString());
         state = ROW;
      } else { // if (state == UNKNOWN)
         throw new SAXException("Invalid program state.");
      }
   }

And here is the code using DOM for a document containing a single table:

   void processTable(Document doc) {
      Element table = doc.getDocumentElement();
      PreparedStatement stmt = getInsertStmt(table.getNodeName());
      Element row = (Element)table.getFirstChild();
      while (row != null) {
         stmt.clearParameters();
         processRow(stmt, row);
         row = row.getNextSibling();
      }
   }
   
   void processRow(PreparedStatement stmt, Element row) {
      Element column = (Element)row.getFirstChild();
      while (column != null) {
         processColumn(stmt, column);
         column = column.getNextSibling();
      }
      stmt.executeUpdate();
   }
   
   void processColumn(PreparedStatement stmt, Element column) {
      String data = getTextChildren(column);
      setParameter(stmt, column.getNodeName(), data);
   }

As you can see, the code is functionally identical, the only difference being the use of SAX or DOM.

2.2 Transferring data from the database to XML

The code to transfer data from the database to XML also follows a common pattern, regardless of whether it uses SAX or DOM:

Open a result set over a table
Create a table element
For each row in the result set, create a row element
For each column in a row, create a column element and use the column's data as that element's PCDATA
Close the result set

Again, the code does not make any assumptions about the names of the tags, so these could be mapped from names in the database using a configuration file.

Here is the code using SAX for a document containing a single table:

   void processTable(String tableName, ContentHandler ch) {
      // Get result set
      ResultSet rs = conn.executeQuery("SELECT * FROM " + tableName);
      String tableTag = getTableTag(tableName);
      String rowTag = getRowTag();
      int numColumns = getColumnCount(rs);
      String[] columnTags = getColumnTags(rs);
   
      // Start document
      Attributes attrs = new Attributes();
      ch.startDocument();
      ch.startElement(null, tableTag, null, attrs);
   
      // Process rows and columns
      while (rs.next()) {
         ch.startElement(null, rowTag, null, attrs);
         for (int i = 1; i <= numColumns; i++) {
            ch.startElement(null, columnTags[i - 1], null, attrs);
            String data = rs.getString(i);
            ch.characters(data, 0, data.length());
            ch.endElement(null, columnTags[i - 1], null);
         }
         ch.endElement(null, rowTag, null);
      }
   
      // End document and close result set
      ch.endElement(null, tableTag, null);
      ch.endDocument();
      rs.close();
   }

And here is the code using DOM for a document containing a single table:

   Document processTable(String tableName) {
      Element table, row, column;
      Text data;
   
      // Get result set
      ResultSet rs = conn.executeQuery("SELECT * FROM " + tableName);
      String tableTag = getTableTag(tableName);
      String rowTag = getRowTag();
      int numColumns = getColumnCount(rs);
      String[] columnTags = getColumnTags(rs);
   
      // Create Document and root (table) node
      Document doc = getNewDocument();
      table = doc.createElement(tableTag);
      doc.appendChild(table);
   
      // Process rows and columns
      while (rs.next()) {
         row = doc.createElement(rowTag);
         table.appendChild(row);
         for (int i = 1; i <= numColumns; i++) {
            column = doc.createElement(columnTags[i - 1]);
            row.appendChild(column);
            data = doc.createTextNode(rs.getString(i));
            column.appendChild(data);
         }
      }
   
      // Close result set and return the Document
      rs.close();
      return doc;
   }

2.3 SAX vs. DOM

As can be seen above, the code to transfer data between an XML document and a relational database according to the table-based mapping is essentially independent of whether SAX or DOM is used. This is because the table-based mapping allows the document to be processed in a single pass in document order, which SAX essentially requires. The ability of DOM to randomly access the document or visit nodes multiple times is never used.

Because of this, any differences between the tools are essentially differences between SAX and DOM themselves, of which two are important. First, the SAX-based code uses much less memory. This is because it buffers only one row of data at a time, while the DOM-based code buffers the entire document. Second, the SAX-based code is faster because it doesn't have to spend time building a DOM tree.

An important consequence of the amount of memory used by each tool is that the SAX-based code is scalable to arbitrarily large documents while the DOM-based code is not. Therefore, there is no reason to use DOM-based code unless the application calling the code happens to use DOM trees.

As a result of its scalability, the table-based model combined with SAX is a useful strategy for transferring data from one database to another identical database. In this case, an XML document contains data from multiple tables, ordering these tables in the document so that primary key rows are always inserted before foreign key rows. (See section 4, Referential integrity for why this is necessary.) This strategy was developed by infoShark for its XMLShark product.

3.0 Object-based mapping

The obvious disadvantage of the table-based mapping is that it applies only to a limited set of XML documents. That is, while it works perfectly for a single table, it doesn't exploit the ability of XML to represent hierarchies of data (of the kind found in relational databases in hierarchies of tables) as nested data.

To solve these problems, we use a different mapping: the object-based mapping. This views the XML document as a tree of data-specific objects (not the DOM), and then uses an object-relational mapping to map these objects to the database. For example, the following XML document:

   <SalesOrder>
      <Number>1234</Number>
      <Customer>Gallagher Industries</Customer>
      <Date>29.10.00</Date>
      <Line Number="1">
         <Part>A-10</Part>
         <Quantity>12</Quantity>
         <Price>10.95</Price>
      </Line>
      <Line Number="2">
         <Part>B-43</Part>
         <Quantity>600</Quantity>
         <Price>3.99</Price>
      </Line>
   </SalesOrder>

can be mapped to these objects:

      object SalesOrder {
         number = 1234;
         customer = "Gallagher Industries";
         date = 29.10.00;
         lines = {ptrs to Line objects};
      }           /   \
                 /     \
                /       \
   object Line {       object Line {
      number = 1;         number = 2;
      part = "A-10";      part = "B-43";
      quantity = 12;      quantity = 600;
      price = 10.95;      price = 3.95;
   }                   }

and then to rows in these tables, which are linked through a primary key / foreign key relationship:

   SaleOrders
   ----------
   Number   Customer                Date
   ------   --------------------    --------
   1234     Gallagher Industries    29.10.00
   ...      ...                     ...
   ...      ...                     ...
   
   
   Lines
   -----
   SONumber   Line   Part   Quantity   Price
   --------   ----   ----   --------   -----
   1234       1      A-10   12         10.95
   1234       2      B-43   600        3.99
   ...        ...    ...    ...        ...

For a complete discussion of the object-based mapping, see see Mapping DTDs to Databases". Although you do not have to understand everything in that presentation, you need to thoroughly understand the following points in order to understand the rest of this paper:

The basic mapping from XML to objects to tables and columns. This includes the mapping from complex element types (those with element content) to classes and then to tables, and the mapping from simple element types (those with PCDATA-only content) to properties and then to columns. Especially important is understanding that each complex element in the document is represented by a row in a class table.
The use of primary key / foreign key links to join the tables corresponding to nested complex elements. Especially important is understanding that the primary key in the link can occur in the table of the parent or child element.
The use of order columns to store information about the order of sibling elements. Although the other paper emphasizes the use of order columns with mixed content, they are often used to round-trip documents with element content, although in some cases the order information can be stored in the mapping itself. Especially important is understanding that all of the order columns for a set of siblings share a single order space, and that these order columns occur in the parent table (for simple element types) as well as in one or more child tables (for complex element types).

The latter two points are at the root of many of the problems discussed later in this paper. (This paper does not consider the mapping from attributes to properties to columns, since attributes can usually be processed in a manner similar to simple element types. It also does not discuss the mapping of multiply-occurring elements or PCDATA to property tables, since the algorithms to process these are similar to those used to process class tables.)

4.0 Referential integrity

Before discussing how to transfer data from XML documents to the database according to the object-based mapping, we need to discuss referential integrity. Referential integrity refers to the validity of foreign keys -- that is, whether foreign keys point to (refer to) valid primary keys. Referential integrity is a necessary part of maintaining a consistent database state: It does users no good if a sales order contains a customer number and there is no corresponding customer record. Shipping won't know where to send the items purchased and Accounting won't know where to send the invoice.

For example, in the following tables, SalesOrder.SONum, Customers.CustNum, and Parts.PartNum are primary keys and Lines.SONum, SalesOrders.CustNum, and Lines.PartNum are foreign keys. All of the foreign key references are valid except those in the last row in the Lines table. This refers to a sales order (456) and a part (GHI) that don't exist.

SaleOrders                            Customers
----------                            ---------
SONum  Date      CustNum              CustNum  Name
-----  --------  -------              -------  ----------
123    29.10.00  007                  007      Bond, Inc.

Lines                                 Parts
-----                                 -----
SONum  Line  PartNum  Quantity        PartNum  Price
-----  ----  -------  --------        -------  -----
123     1    ABC      12              ABC      12.95
123     2    DEF      600             DEF       5.95
456     1    GHI      10

To ensure referential integrity is preserved -- that is, that foreign keys always point to valid primary keys -- a row containing a primary key must be inserted before a row containing a foreign key that points to that primary key. (This is not strictly true; for a discussion, see the end of the section.) For example, in the tables above, this means that a row must be inserted into Customers before a row can be inserted into SalesOrders, a row must be inserted into SalesOrders before a row can be inserted into Lines, and a row must be inserted into Parts before a row can be inserted into Lines. In other words, one possible insert sequence for the above set of tables is: Customers, SalesOrders, Parts, Lines.

Referential integrity and the insert order it imposes are relevant to transferring data from XML documents to the database because it affects the order in which the data in an XML document can be processed. This is a consequence of the fact that the primary key joining the tables of two nested elements can be in the table of the parent element or the child element. Thus, it may be necessary to insert the data for the parent element before the data for the child element and vice versa.

To see what this means, let's look at a few examples. In our first example, the parent element (Department) is stored in the Departments table and the child element (Employee) is stored in the Employees table. The key joining these (DeptNum) is in the Departments table. Thus, to process this document, we first insert a row from the parent element (Department) into the Departments table, then insert three rows from child elements (Employee) into the Employees table.

   <Department>
      <DeptNum>123</DeptNum>
      <DeptName>Sales</DeptName>
      <Employee>
         <Number>143</Number>
         <Name>Raul Lopez</Name>
      </Employee>
      <Employee>
         <Number>687</Number>
         <Name>John Smith</Name>
      </Employee>
      <Employee>
         <Number>947</Number>
         <Name>Ming Chu</Name>
      </Employee>
   </Department>

In our second example, the situation is reversed. The primary key (DeptNum) is in the table of the child element (Department). Thus, we must insert a row into the table of the child element before inserting a row into the table of the parent element.

   <Employee>
      <Number>143</Number>
      <Name>Raul Lopez</Name>
      <Department>
         <DeptNum>123</DeptNum>
         <DeptName>Sales</DeptName>
      </Department
   </Employee>

Our third example is more complex and has primary keys in the tables of both parent and child elements. It uses the sales order tables shown previously. The insert order is as was stated earlier. That is, we first insert a row into the Customers table, then into the SalesOrders table, then into the Parts table, then into the Lines table, and so on.

   <SalesOrder>
      <Number>123</Number>
      <Date>10/29/00</Date>
      <Customer>
         <CustNum>007</CustNum>
         <Name>Bond, Inc.</Name>
      </Customer>
      <Line>
         <LineNum>1</LineNum>
         <Quantity>3</Quantity>
         <Part>
            <PartNum>ABC</PartNum>
            <Price>12.95</Price>
         </Part>
      </Line>
      <Line>
         ...
      </Line>
   </SalesOrder>

Whether referential integrity affects insert order actually depends on where transactional lines are drawn. Within a single transaction, rows can be inserted in any order (depending on how various options are set in the database). Thus, if you transfer all of the data from an XML document to the database in a single transaction, referential integrity is not an issue with respect to insert order.

The problem with this is that it might not be a reasonable thing to do. For example, suppose you have an XML document of several megabytes containing thousands of sales orders. The problem with transferring all of the data from this document in a single transaction is that an error in the data for one sales order, such as an invalid date, causes the entire transaction to be aborted, thus rolling back the insertion of other (unrelated) sales orders. Such a large transaction may (I'm not sure) also unnecessarily tie up database resources.

The alternatives are to: (a) insert one row at a time and commit after each row, or (b) insert one fragment (such as one sales order) at a time and commit after each fragment. Option (a) requires you insert primary key rows before inserting foreign key rows. Option (b) requires your software to know what constitutes a fragment. This may be simple in the case of software designed to transfer data from documents with a known structure, but appears to be non-trivial to determine (or for users to specify) in the generic case.

In the rest of this paper, we will assume that the software chooses option (a).

5.0 Transferring data from XML to the database

From the preceding discussion about referential integrity, the problem we encounter in transferring data from XML to the database should be obvious: If we traverse the document in document order, we have to buffer data until it can be inserted into the database. If we traverse the document in some other order, we have to buffer the document so it can be accessed in this other order. In contrast to processing the document according to the table-based model, the ability of DOM to provide random access and allow nodes to be revisited may be of some use here.

5.1 SAX-based transfer

SAX reads a document a single pass, in depth-first, width-second order. As can be seen from the above examples, this means that one or more rows of data will have to be buffered while processing documents. How many rows must be buffered depends on how the document is mapped to the database and the intelligence of the transfer tool. This is most easily seen from looking at a few best- and worst-case examples.

We start with a worst-case example. In this case, the primary key is always in the table of the child element. Thus, as we read through the document, we must first buffer the employee data, then the department data, and finally the company data. When we reach the end of the Company element, we can insert a row in the Companies table, then we can insert a row in the Departments table, and finally we can insert a row in the Employees table. Notice that in this case, we buffered the data for the entire document before inserting any of it.

   <Employee>
      <Number>143</Number>
      <Name>Raul Lopez</Name>
      <Department>
         <DeptNum>123</DeptNum>
         <DeptName>Sales</DeptName>
         <Company>
            <CompName>Bond, Inc.</CompName>
            <Address>123 Main St.</Address>
         </Company>
      </Department>
   </Employee>

Now let's look at a best case example. In this case, the primary key is always in the table of the parent element. This means that we can insert rows as we encounter them. That is, after reading the DeptNum and Name elements, we can insert a row into the Departments table, and after reading the contents of each Employee element, we can insert a row into the Employees table. In this case, the software buffers only one row of data at a time.

   <Department>
      <DeptNum>123</Number>
      <DeptName>Sales</DeptName>
      <Employee>
         <Number>143</Number>
         <Name>Raul Lopez</Name>
      </Employee>
      <Employee>
         <Number>687</Number>
         <Name>John Smith</Name>
      </Employee>
   </Department>

Unfortunately, the best case is more fragile than it looks. The first problem is that it requires the software to recognize when it has all of the data for a given row. For example, the software must recognize that it has all of the department data when it receives the endElement event for the DeptName element, not when it receives the endElement event for the Department element. This requires the software to know the content model for the Department element and to be able to act on that information; whether most data transfer software is this sophisticated is debatable. It also requires that the document have a DTD and that the DTD be available; this limits the software's applicability.

The second problem is that the best case can be turned into the worst case by a simple transposition of elements. For example, the following document contains the same data as the best case document, but the DeptNum and DeptName elements occur after the last Employee element. This means that the department row can't be inserted until after all of the employee rows have been read. In consequence, all of the data in the document must be buffered before any of it can be inserted. (This problem is also an opportunity, as it means that some documents can be "tuned" to optimize transfer speed.)

   <Department>
      <Employee>
         <Number>143</Number>
         <Name>Raul Lopez</Name>
      </Employee>
      <Employee>
         <Number>687</Number>
         <Name>John Smith</Name>
      </Employee>
      <DeptNum>123</Number>
      <DeptName>Sales</DeptName>
   </Department>

In the next sections, we will see if using the DOM helps us avoid any of these problems.

5.2 DOM-based transfer

It is often thought that the DOM provides random access to an XML document, allowing it to be read in any order. In fact, this is somewhat of a misnomer. Unless a document is indexed, applications must search linearly through the document -- starting each set of siblings from either the first or last sibling -- in order to reach a given node. While this doesn't mean reading the entire document to reach a given node, it also isn't the same as jumping directly to a node or even doing a binary search for that node.

However, the DOM does allow nodes to be revisited easily. When an application visits a given node, it can buffer a pointer to that node and return directly to it. This is useful to us, since it means that data transfer software can read a node, decide whether it contains data to be transferred immediately and, if not, return at a later point in time.

We will now consider three different strategies for transferring data from a DOM tree to the database. These trade off the amount of memory used to buffer data with the number of times a given node must be visited.

5.2.1 Depth-first, width-second

The first strategy is to traverse the document in depth-first, width-second order -- that is, the same order used by SAX. While this strategy visits the least nodes -- it visits each node only once -- it also leads to the same buffering problems as in the SAX-based tool. That is, in the worst case, it means that the all of the data in the document must be buffered. Thus, this strategy offers no advantages over using SAX.

5.2.2 Minimal memory

Our second strategy minimizes the amount of memory used, buffering only one row of data at a time. It does this by traversing the children of a given parent three times. On the first pass, it processes all children that contain primary keys -- that is, complex elements in which the key used to link the tables of the parent and child elements occurs in the table of the child element. These are known as primary key elements.

On the second pass, it processes all children that contain data -- that is, simple elements. These are known as data elements. At the end of this pass, it inserts a row into the parent element's table. Notice that, unlike in the case of the SAX-based tool or the depth-first, width-second strategy, the software can easily recognize when it has all of the data for a given row. It knows this because it hits the end of a set of siblings. That is, Node.getNextSibling() returns null.

On the third pass, the software processes all children that contain foreign keys -- that is, complex elements in which the key used to link the tables of the parent and child elements occurs in the table of the parent element. These are known as foreign key elements.

This strategy is best understood by example. The following diagram represents a DOM tree, with Element nodes labeled by their element tag name and Text nodes labeled as Text. The numbers beneath each node represent the order in which nodes are visited. Notice that element nodes are visited three times -- once for each pass -- and text nodes are visited once, when processing data elements.

                            SalesOrder
                       _______  1
        ______________/ _____/ / \ \____
       /     __________/ _____/   \____ \______________________
      /     /           /              \                       \
Number  Date    Customer                Line                    Line
2,15,22 3,17,23 4,19,24               13,20,25                14,21,45
  |      |        /   \              /    |   \_____            /   |
  |      |       /     \            /     |         \          /    |
 Text   Text  CustNum  Name  LineNum  Quantity       Part     ...  ...
  16     18   5,7,11  6,9,12 26,37,42 27,39,43      28,41,44
                 |      |      |         |            / \
                 |      |      |         |           /   \
                Text   Text   Text      Text    PartNum  Price
                 8      10     38        40    29,31,35  30,33,36
                                                   |       |
                                                   |       |
                                                  Text    Text
                                                   32      34

Thus, on the first pass through the children of SalesOrder, the Number and Date elements are ignored because they are data elements -- that is, they contain PCDATA mapped to columns in the SalesOrders table. The Customer element is processed because it is a primary key element -- that is, the primary key used to link the SalesOrders and Customers tables is in the Customers table. And the two Line elements are ignored because they are foreign key elements -- that is, the foreign key used to link the SalesOrders and Lines tables is in the Lines table.

On the second pass, the data elements (Number and Date) are processed -- at this time, the Text nodes are visited -- and a row is inserted into the SalesOrders table. On the final pass, the foreign key elements (the Line elements) are processed.

The process is recursive. For example, when a Line element is processed, a row isn't immediately inserted into the Lines table. Instead, the first pass through the children of the Line element uncovers a primary key element (Part) and processes that element, inserting a row into the Parts table, before a row is inserted into the Lines table.

While the minimal memory method is more efficient than the depth-first, width-second method with respect to the amount of data buffered, it does require nodes to be visited three times as often. As it turns out, this number can be reduced without using too much additional memory.

5.2.3 Buffered pointers

Our third, and final, strategy reduces the number of times each node is visited. It is based on the fact that searching through an unindexed DOM tree is linear for any given set of siblings. That is, even if the software knows which elements are primary key elements, data elements, and foreign key elements, it can't go directly to them. Instead, it must start at the beginning of a set of siblings and search through them one at a time.

However, during this search, the software can buffer pointers to nodes it visits but doesn't process immediately. Thus, during the first pass through a set of siblings, in which the software is looking for primary key elements, the software can buffer pointers to the data and foreign key elements it finds. This means the second and third passes of the minimal memory strategy can be reduced to visiting only the nodes that need to be processed on those passes.

This reduces the total number of nodes visited by at least one third and up to two thirds. (In the worst case, the document consists only of data and foreign key elements. These are visited twice -- a one third reduction. In the best case, the document consists only of primary key elements. These are visited once -- a two-thirds reduction.)

To see how this works, lets go back to our diagram. In this case, the first number underneath an element represents the first time it is visited and the second number (if there is one) represents the time it is visited as a data or foreign key node.

                            SalesOrder
                       _______  1
        ______________/ _____/ / \ \____
       /     __________/ _____/   \____ \______________________
      /     /           /              \                       \
Number  Date    Customer                Line                    Line
 2,13   3,15       4                    11,17                  12,31
  |      |        /   \              /    |   \_____          /   |
  |      |       /     \            /     |         \        /    |
 Text   Text  CustNum  Name  LineNum  Quantity       Part   ...  ...
  14     16     5,7    6,9    18,27    19,29          20
                 |      |      |         |            / \
                 |      |      |         |           /   \
                Text   Text   Text      Text    PartNum  Price
                 8      10     28        30      21,23   22,25
                                                   |       |
                                                   |       |
                                                  Text    Text
                                                   24      26

On the first pass through the children of the SalesOrder element, all children are visited. The software buffers pointers to the Number and Data elements (data elements), processes the Customer element (a primary key element), and buffers pointers to the Line elements (foreign key elements). On the second pass, it returns to the Number and Data elements, processes them, and inserts a row into the SalesOrders table -- the Customer and Line elements are not revisited. On the final pass, the software returns only to the Line elements and processes them. Like the minimal memory strategy, the process is recursive.

The strategy buffers only one row data at a time, like the minimal memory strategy. However, it also buffers pointers in exchange for reducing the number of nodes visited. In the best case, no pointers are buffered; this occurs when the document contains only primary key elements. In the worst case, the number of pointers buffered is the number of data elements in the document; this occurs when the document contains no foreign key elements.

5.3 SAX vs. DOM

Assuming that pointers are significantly smaller than individual pieces of data, the buffered pointers strategy provides a reasonable tradeoff between memory and speed when using a DOM-based tool. However, it still can't ever beat the SAX-based tool for memory use and speed. The reason for this is something that should have occurred to alert readers: a DOM tree contains all of the data in a document.

In the worst case, the SAX-based tool buffers all of the data in a document. In the best case, all of the DOM-based tools buffer a DOM tree plus one row of data. Because the DOM tree contains all of the data, the DOM-based tools necessarily use more memory. How much more memory depends on the average size of the data per node. A DOM level 1 node contains 9 pointers to other nodes or, at 4 bytes per pointer, 36 bytes of pointers per node. It also contains other information such as the name and type of the node. If the data is large with respect to the node overhead -- for example, 1000 bytes per node -- then the size of the DOM tree is roughly the size of the data in that tree. On the other hand, if the data is small with respect to the node overhead -- for example, 10 bytes per node, which is more common in data-centric applications -- then the size of the DOM tree can be several times the size of the data.

With respect to speed, the SAX-based tool visits each node only once -- a speed record that the DOM-based tools can equal in some cases but never beat. Furthermore, the SAX-based tool does not need to spend time building a DOM tree and also traverses the document more quickly. This is because traversing the document is part of parsing and does not require any extra method calls.

Thus, when using the object-based mapping to transfer data from XML documents to the database, SAX-based tools are always faster and use less memory than DOM-based tools. However, the difference is not always as great as in the case of tools that use the table-base mapping and, more importantly, SAX-based tools are not arbitrarily scalable. This is because, in the worst case, they buffer all of the data in the document.

Whether this is a problem is debatable. The reason is that the object-based mapping, as was seen in the presentation on mapping DTDs to databases, is most useful for data-centric documents and data-centric documents tend to be relatively small. Thus, the scalability problems are rarely encountered. Furthermore, most large data-centric documents tend be serializations of entire databases and the structure of these documents is such that they are almost never buffered in their entirety.

In particular, documents that serialize a database essentially consist of multiple, data-centric subdocuments stored under a single root element. Each subdocument represents the hierarchy of data associated with a single row in the top-level table and is therefore, like any data-centric document, relatively small. The root element exists only to gather the subdocuments into a single document and is ignored, so processing these documents requires the same amount of memory as if each subdocument was processed separately.

Finally, it is worth noting that the XML documents used by the SAX-based tool can be "tuned" -- that is, they can be designed to minimize the amount of data that needs to be buffered. To do this, each set of siblings must be arranged in the following order: primary key elements, data elements, foreign key elements. When this is done, the SAX-based tool buffers only one row of data at a time and scales to arbitrarily large documents.

6.0 Transferring data from the database to XML

The data for a given XML document can be spread across multiple tables in the database hierarchy. For example, our sales order document is spread across four tables:

                SaleOrders
                ----------
                SONum  Date      CustNum
                -----  --------  -------
                123    29.10.00  007
                   /                \
                  /                  \
   Lines         /                   Customers
   -----        /                     ---------
   SONum  Line  PartNum  Quantity    CustNum  Name
   -----  ----  -------  --------    -------  ----------
   123     1    ABC      12          007      Bond, Inc.
   123     2    DEF      600
               |
               |
            Parts
            -----
            PartNum  Price
            -------  -----
            ABC      12.95
            DEF       5.95

What is important to note about this is that sibling elements can come from multiple tables. For example, the children of the SalesOrder element -- SONum, Date, Customer, and Line -- come from three different tables -- SalesOrders, Customers, and Lines. The first two elements (SONum and Date) represent data from columns in the SalesOrder table and are known as column elements, while the latter two elements (Customer and Line) represent entire rows in the Customers and Lines tables and are known as row elements.

Unlike the case when inserting data into the database, the relative position of the primary and foreign keys are unimportant when retrieving data from the database. This is because reading data has no effect on referential integrity. The data is simply read from the top down, without regard to whether the key used to link a parent to its child is a primary or foreign key.

6.1 Data transfer problems

There are two main problems in transferring data from the database to an XML document. The first is simply how to traverse the hierarchy efficiently. The strategies discussed below trade off between the result set size, the amount of client memory used, and the number of SELECT statements executed. Except in one case, they are independent of whether SAX or DOM is used. That is, the software can generate SAX events or build a DOM tree. Because of this, the software should always generate SAX events unless the application happens to need a DOM tree. As we have already seen, this is always faster and takes less memory.

The second problem in retrieving the data is sibling order. If sibling order is not stored in the database, it is sufficient to traverse the hierarchy in hierarchical order and create elements in the order found. However, if sibling order is stored in the database (as opposed to in the mapping), elements must be placed in their correct position in their parent. This can be done in one of two ways. The first is to traverse the hierarchy in document (hierarchical + sibling) order and create elements in document order. This is difficult because: (a) the next element can come from the parent table (column elements) or from any of the child tables (row elements), and (b) the order values used to determine which element occurs next are stored parallel to the data. That is, they are stored both in the parent table and in the child tables. This makes it difficult to sort the data.

The second way to place elements in their correct position is to traverse the database in hierarchical order and create an ordered DOM tree. An ordered DOM tree is one in which order values are stored with DOM nodes. This allows the software to compare the order value of a given element with the order values of its sibling elements and place the new element in its correct position. An ordered DOM tree can be implemented by extending the objects in a particular DOM implementation or maintaining a parallel tree with order values.

6.2 Parallel tables

Our first strategy traverses the database hierarchy in document order. To do this, it starts by opening a result set over the table for the root element. For each row in this table, it creates a row element, then opens result sets over all child tables in parallel. The result sets are ordered by their order columns -- that is, the columns that provide the order row elements for the child table occur in their parent.

The software then reads the child result sets in parallel. That is, it checks the order values in the current row in each result set and selects the row with the lowest value. It then compares this to the order values for the data columns in the parent row. If a data column has the lowest order value, the software creates a column element for it and repeats the process with the remaining data columns. If a child row has the lowest order value, the software processes the row recursively.

For example, suppose we have the following table hierarchy:

   Table A
   a1  a1Order  a2  a2Order  a3  a3Order
   --  -------  --  -------  --  -------
   a1     6     a2     2     a3     3

   Table B
   a1  a1Order  b1   b1Order  b2   b2Order  b3   b3Order
   --  -------  ---  -------  ---  -------  ---  -------
   a1     1     b1     1      b2      2     b3      3
   a1     5     b11    1      b22     2     b33     3

   Table C
   a1  a1Order  c1  c1Order  c2  c2Order
   --  -------  --  -------  --  -------
   a1     4     c1     1     c2     2

which represents the following XML document:

   <A>
      <B>
         <b1>b1</b1>
         <b2>b2</b2>
         <b3>b3</b3>
      </B>
      <a2>a2</a2>
      <a3>a3</a3>
      <C>
         <c1>c1</c1>
         </c2>c2</c2>
      </C>
      <B>
         <b1>b11</b1>
         <b2>b22</b2>
         <b3>b33</b3>
      </B>
      <a1>a1</a1>
   </A>

The software first opens a result set over table A and retrieves a row. It creates an element for this row (<A>) and then uses the primary key (a1) to open result sets over tables B and C. These result sets are sorted on the a1Order column, which gives the order of row elements for the tables in the <A> element.

The software then compares the order values in the first row in each result set and finds that the first row in table B has the lowest value (1). It compares this with the order columns for the data columns in table A (a1Order, a2Order, and a3Order) and finds that the first row in table B has the lowest value. It therefore creates a row element (<B>) and processes the row recursively, creating column elements for the data columns in table B (b1, b2, and b3).

It then repeats the process, this time using the second row in table B and the first row in table C. It finds that the data column A.a2 has the lowest order value, so it creates an element for this column (<a2>). Repeating the process again, it creates (in order) an element for column a3, an element for the first row in table C, an element for the second row in table B, and an element for column a1.

The parallel tables method has a single advantage: it can traverse the database hierarchy in document order. Because of this, it can generate SAX events for the document instead of building a DOM tree. Unfortunately, this advantage comes at a hefty price: a large number of result sets are open at the same time. In particular, for a given set of sibling tables, there are result sets open over these tables, plus all sibling tables at higher levels in the path back to the root table.

For example, suppose we have a set of tables in the following hierarchy:

                 Table A
               /    |    \
        Table B  Table C  Table D
         /  \                |
   Table E  Table F       Table G
                           /  \
                     Table H   Table I

When processing table A, the software opens result sets over tables B, C, and D. When processing table D, the software opens a result set over table G. And when processing table G, the software opens result sets over tables H and I. This results in the following open result sets:

             Result A
           /    |     \
   Result B  Result C  Result D
                          |
                       Result G
                        /     \
                  Result H   Result I

In addition to open result sets, the parallel tables strategy also must execute a very large number of SELECT statements: the sum of the number of rows returned from all non-leaf tables. And it runs the risk of deadlocks if another process is deleting a row in a child table while it is holding that row's parent in a result set over the parent table.

In short, the parallel tables method is the worst of both worlds: lots of database resources used and lots of SELECT statements executed. As a final insult, there is no way to optimize the process when sibling order is not stored in the database.

6.3 Individual tables

Our second strategy reduces the number of results set open at one time. To do this, instead of opening result sets over all child tables at once, it opens them individually. If sibling order is not stored in the database, the software processes results and creates elements in the order found. That is, for a given row in a parent table, it creates a row element and then creates elements for the data columns in the row. It then opens result sets over the child tables one at a time and processes each in turn.

The advantage of this process is that the software can issue SAX events in the case where order is not stored in the database with far fewer open result sets than is the case with the parallel tables method. In particular, the number of open result sets equals the current depth, as can be seen in the following diagram, which shows the state when processing table I:

   Result A
            \
             Result D
                |
             Result G
                    \
                   Result I

If sibling order is stored in the database, a different strategy is needed. In this case, there is no guarantee that results will be returned in document order. Because of this, the software needs to build an ordered DOM tree (as was described above) whenever it encounters rows or columns representing elements that occur after the current position in the document. The DOM tree can be flushed to SAX events whenever the software fills in the missing elements. Note that for such flushing to be possible, the sequence of order values must have a known start, a fixed increment, and no gaps. In the worst case -- that is, if the order sequence doesn't meet these requirements or the first element is returned last -- this method buffers the entire document as a DOM tree.

The individual tables method is generally better than the parallel tables method as it uses significantly fewer database resources. When sibling order is not stored in the database, this is a clear win. However, when sibling order is stored in the database, it comes at the cost of having, in the worst case, to build a DOM tree for the entire document. In addition, it requires the same number of SELECT statements as the parallel tables method and has the same potential for deadlocks. Finally, it doesn't scale well with respect to local memory when sibling order is stored in the database and doesn't scale well with respect the number of SELECT statements executed regardless of sibling order.

6.4 Universal table I

Our third and fourth strategies both create a single result set -- a universal table -- containing all of the data in the document. This result set is large and non-normal but generally can be materialized one row at a time. This is because it is constructed using primary key / foreign key pairs, which are usually indexed, thus saving the database from having to materialize the entire result set in order to sort it. The other advantage of universal tables is that they substantially reduces the number of SELECT statements executed and, assuming the database engine is clever enough, eliminate the possibility of deadlocks.

There are actually two types of universal tables, where type I is simply created by joining all the relevant tables and type II uses UNIONs to join the tables.

WARNING: Because I have not implemented a type I universal table, I am not certain that the rest of this section is entirely correct. Corrections welcome.

To construct a type I universal table:

Create single SELECT statement with a select list containing all data columns in all relevant tables.
Include the join columns for all tables in both the select list and the join clause, so each row returned by the SELECT fully identifies its ancestors. This results in a (potentially) large amount of duplicate data.
Sort the SELECT statement in hierarchical or document order:
- To sort in hierarchical order, place all the join columns in the ORDER BY clause in hierarchical order. That is, each join column must appear to the right of all of its ancestral join columns; for a given set of join columns, there are usually multiple ORDER BY clauses that satisfy this requirement.
- It is not clear whether a type I universal table can be sorted in document order. At first glance, it does not appear to be possible, since it would require the database to impose a single sort order on multiple columns.

For example, the following SELECT statement is used to construct a type I universal table over the hierarchy of tables used to construct the sales order document:

   SELECT   SONum AS SONum, Date,
            Cust.CustNum, Name,
            Line, Quantity,
            Parts.PartNum, Price
   FROM     SalesOrders, Customers, Lines, Parts
   WHERE    SalesOrders.CustNum = Customers.CustNum AND
            SalesOrders.SONum = Lines.SONum AND
            Lines.PartNum = Parts.PartNum
   ORDER BY SONum, Line, PartNum, CustNum

The type I universal table for our sales order data is as follows:

   SONum Date     CustNum Name       Line Quantity PartNum Price
   ----- -------- ------- ---------- ---- -------- ------- -----
   123   29.10.00  007    Bond, Inc.  1    12       ABC    12.95
   123   29.10.00  007    Bond, Inc.  2    600      DEF     5.95

A type I universal table can be thought of as multiple mini-result sets (one for each table) appended horizontally.

     SalesOrders  |    Customers       |     Lines     |     Parts
                  |                    |               |
   SONum Date     | CustNum Name       | Line Quantity | PartNum Price
   ----- -------- | ------- ---------- | ---- -------- | ------- -----
   123   29.10.00 |  007    Bond, Inc. |  1    12      |  ABC    12.95
   123   29.10.00 |  007    Bond, Inc. |  2    600     |  DEF     5.95

Each unique row in each of these mini-result sets represents a single row in the table hierarchy and thus a single element with element content. Each column in each mini-result set represents an element with PCDATA content nested in its row element. The nesting of row elements must be determined from external mapping information and, assuming the select list follows the same order as the ORDER BY clause, reads roughly from left to right. However, since a hierarchy is flattened into a single line (row), multiple orders can represent the same hierarchy.

For example, in the above result set, the Customer and Line elements (corresponding to rows in the Customers and Lines mini-result sets) are nested in the SalesOrder element, while the Part element (corresponding to rows in the Parts mini-result set) is nested in the Line element.

To process a type I universal table, the software reads the rows in order, determines which of the join columns has changed, and creates elements accordingly. For example, for the first row, it notes that the SONum column has changed (it is initialized to NULL) and creates a SalesOrder element for the row. It then creates two child elements for the columns in the SalesOrders table (SONum and Date):

   <SalesOrder>
      <Number>123</Number>
      <Date>29.10.00</Date>

Note that the column names aren't required to match the element names -- they can be mapped separately.

Next, reading horizontally across the universal table, it notes that the CustNum element has also changed and creates an element for that row, as well as child elements for the columns in the Customers table (CustNum and Name). It nests the Customer element inside the SalesOrder element because it knows to do so from the mapping information. It also closes the Customer element because it knows from the mapping information that no other elements are nested inside the Customer element.

         <Customer>
            <CustNum>007</CustNum>
            <Name>Bond, Inc.</Name>
         </Customer>

Finally, it creates the Line and Part elements in similar fashion:

         <Line>
            <LineNum>1</LineNum>
            <Quantity>12</Quantity>
            <Part>
               <PartNum>ABC</PartNum>
               <Price>12.95</Price>
            </Part>
         </Line>

To process the second row, it notices that the join keys for the SalesOrders and Customers mini-result sets haven't changed, but the keys for Lines has, so it creates elements for the subhierarchy represented by the rows in the Lines and Parts mini-result sets:

         <Line>
            <LineNum>2</LineNum>
            <Quantity>600</Quantity>
            <Part>
               <PartNum>DEF</PartNum>
               <Price>5.95</Price>
            </Part>
         </Line>

It then processes the final "row", in which the join keys are NULL, telling it to close the SalesOrder element:

   </SalesOrder>

One drawback of the type I universal result set is that it appears to work only when the join keys are unique. Otherwise, the processing software cannot determine when join keys have changed and the rows are not sorted in the correct order. For example, suppose that the SalesOrders table had two rows with SONum equal to 123. The resulting type I universal table is as follows. Note that the processing software would not notice any change in join keys between the first and second rows or between the third and fourth rows, meaning that only one hierarchy of elements with SONum equal to 123 would be created.

   SONum Date     CustNum Name       Line Quantity PartNum Price
   ----- -------- ------- ---------- ---- -------- ------- -----
   123   29.10.00  007    Bond, Inc.  1    12       ABC    12.95
   123   29.10.00  007    Bond, Inc.  1    12       ABC    12.95
   123   29.10.00  007    Bond, Inc.  2    600      DEF     5.95
   123   29.10.00  007    Bond, Inc.  2    600      DEF     5.95

6.5 Universal table II

Our final strategy is to create second type of universal table, this time using UNION statements. The main advantage of the type II universal table appears to be that it uses less space, since it returns NULLs instead of duplicate data for non-join key columns. This results in somewhat faster processing.

The idea for type II universal tables comes from Microsoft SQL Server. The type II universal table discussed here has two slight variations on that found in SQL Server. First, it does not contain the information used to map the table to the XML document, which SQL Server stores as metadata and in two extra columns. Second, it can be constructed to use sibling order information, something which SQL Server does not support.

To construct a type II universal table:

Create SELECT statement over each table.
Include the join columns for all ancestor tables in both the select list and the join clause, so each row returned by the SELECT fully identifies its ancestors.
Join the SELECT statements with UNION ALL clauses. This requires each SELECT statement to contain all of the columns in all of the other SELECT statements and results in many NULLs in the result set.
Sort the unioned SELECT statements in hierarchical or document order:
- To sort in hierarchical order, place all the join columns in the ORDER BY clause in hierarchical order. That is, each join column must appear to the right of all of its ancestral join columns; for a given set of join columns, there are usually multiple ORDER BY clauses that satisfy this requirement.
- To sort in document order, add one order column per set of sibling tables and place these in the ORDER BY clause in hierarchical order. (Sorting in document order is actually rather complex. It also generally causes the entire result set to be materialized for sorting, since order columns generally aren't indexed. For complete details, see appendix A.)

For example, the following SELECT statements are used to construct a type II universal table over the hierarchy of tables used to construct the sales order document:

   SELECT SONum AS SONum, Date,
          NULL AS CustNum, NULL AS Name,
          NULL AS Line, NULL AS Quantity,
          NULL AS PartNum, NULL AS Price
   FROM SalesOrders

   UNION ALL

   SELECT SONum, NULL,
          Cust.CustNum, Name,
          NULL, NULL,
          NULL, NULL
   FROM SalesOrders, Customers
   WHERE SalesOrders.CustNum = Customers.CustNum

   UNION ALL

   SELECT SalesOrders.SONum, NULL,
          NULL, NULL,
          Line, Quantity,
          NULL, NULL
   FROM SalesOrders, Lines
   WHERE SalesOrders.SONum = Lines.SONum

   UNION ALL

   SELECT SalesOrder.SONum, NULL,
          NULL, NULL,
          Line, NULL,
          Parts.PartNum, Price
   FROM SalesOrders, Lines, Parts
   WHERE (SalesOrders.SONum = Lines.SONum)
     AND (Lines.PartNum = Parts.PartNum)

   ORDER BY SONum, Line, PartNum, CustNum

Each statement retrieves data from a different table, joining that data to all its ancestors. By also including these join elements in the ORDER BY clause, the rows can be sorted in hierarchical order. That is, each row element is guaranteed to be created in the correct parent element.

The type II universal table for our sales order data is as follows:

   SONum Date     CustNum Name       Line Quantity PartNum Price
   ----- -------- ------- ---------- ---- -------- ------- -----
   123   29.10.00  --      --         --   --       --      -- 
   123    --       007    Bond, Inc.  --   --       --      -- 
   123    --       --      --         1    12       --      -- 
   123    --       --      --         1    --       ABC    12.95
   123    --       --      --         2    600      --      -- 
   123    --       --      --         2    --       DEF     5.95

Each row in the universal table represents a single row in the table hierarchy in the database. The NULL columns in the row correspond to columns in other tables, and the non-NULL columns come either from the table itself or from the columns used to join the table back to the root table. For example, the last row represents the second row in the Parts table. It includes values from two columns the Parts table (PartNum and Price) and the join columns used to join it back to the SalesOrders table (SalesOrder.SONum and Lines.Line).

As an aside, type II universal tables must be sorted so that NULLs sort at the low end. This is because for any given row, the values of the join columns at all lower levels will be NULL. If NULLs sort at the high end, the table will sort in reverse hierarchical order. For example, consider the first row, which comes from the SalesOrder table. In this row, the values of CustNum, Line, and PartNum are NULL. If NULLs sorted at the high end, this row would sort to the end of the table, not the start.

To process a type II universal table, the software reads the rows in order, determines which of the join columns has changed, and, assuming sibling order is not stored in the database, creates elements accordingly. For example, for the first row, it notes that the SONum column has changed (it is initialized to NULL) and creates a SalesOrder element for the row. It then creates two child elements for the columns in the SalesOrders table (SONum and Date):

   <SalesOrder>
      <Number>123</Number>
      <Date>29.10.00</Date>

Note that the column names aren't required to match the element names -- they can be mapped separately.

In the second row, the join column from the Customers table (CustNum) has changed, so the software creates an element for that row. It then creates child elements for the columns in the Customers table (CustNum and Name):

         <Customer>
            <CustNum>007</CustNum>
            <Name>Bond, Inc.</Name>

In the third row, the CustNum column changes from non-NULL to NULL, telling the software to close the element before creating a new element for the row:

         </Customer>
         <Line>
            <LineNum>1</LineNum>
            <Quantity>12</Quantity>

Processing continues until all rows have been read, at which time the software closes the root element.

If the result set contains sibling order information, the software must proceed in slightly different fashion. It still creates a row element for each row encountered, but instead of immediately creating column elements for the data columns in the row, it caches the row and reads the next (child) row. It compares the order value in this row with the order values for the data columns in the parent row, then creates elements accordingly. When it reads a row at the same or higher level as the parent row, it creates elements for any remaining data columns in the parent row, discards the cached row, and starts the process over.

Like the type I universal table, the type II universal table does not appear to work correctly when the join keys are not unique. Although the processing software does know how many rows to create because each row in the type II universal table does represent a separate element, the rows are not sorted in the correct order.

6.5.1 Type II universal tables in SQL Server

SQL Server optimizes the processing of type II universal tables by assigning each SELECT statement a unique number, the Tag number. It then includes two extra columns in the table -- one for the Tag number of the row and one for the Tag number of the parent SELECT statement. This allows the processing software to be ignorant of the join columns. Instead, it checks if the Tag number has changed.

If it is the same, it closes the current element. If it has changed and the new parent Tag number is not equal to the child Tag number from the previous row, the software closes the current elements and all elements up to (but not including) the element represented by the new parent Tag number. If it has changed and the new parent Tag number equals the child Tag number from the previous row, the software does nothing. In all cases, the software then starts an element for the row and creates elements for the data columns in the row.

SQL Server also labels columns by their Tag number and element (or attribute) name, which makes it easy for the processing software to determine which data columns to create elements or attributes for and what names to use in doing this.

SQL Server does not support order columns and treats them like any other data column.

6.6 Comparison

So what is the best method for transferring data from the database to XML? First, we should note that it is largely independent of SAX and DOM. Except when the individual tables method is used and sibling order is stored in the database, all of the methods can either generate SAX events or build a DOM tree. Given the choice, there is little reason to use DOM -- it requires more memory and is slower.

That said, universal tables (either type I or type II) are the best choice when sibling order is not stored in the database. They execute far fewer SELECT statements than the other methods and the database engine can optimize retrieval because it, not the data transfer software, has control over the joins. Furthermore, assuming that the proper indexes are available, only one row of data is materialized at a time, allowing the method to scale well.

The main disadvantage of universal tables is that they don't appear to scale well when sibling order is stored in the database. This is because the ORDER BY clause in the SELECT statement will almost certainly cause the entire result set to be materialized. For very large documents, this could be a problem, although one redeeming feature is that the memory is on the server, not the client.

However, there appears to be little other choice, as neither of the other methods appears to scale well. The parallel tables method doesn't scale for deep or wide documents, and the individual tables method scales well only with respect to memory in the unordered case. Fortunately, as was mentioned earlier, the object-based mapping is best used with data-centric documents, which tend to be relatively small. Thus, scalability might not be an issue.

In an earlier version of this paper, I noted that a graduate student who is looking for a good project might consider implementing each of these methods and determining which scales best and why. It appears that this has been done, and then some. See:

Efficiently Publishing Relational Data as XML Documents
by J. Shanmugasundarm, E. Shekita, R. Barr, M. Carey,
B. Lindsay, H. Pirahesh, and B. Reinwald

This covers a superset of the strategies discussed in this paper, as well as experimental results describing the performance of each. The correlation of names is roughly "Individual Tables" = "Stored Procedures", "Universal Table I" = "Redundant Relation", "Universal Table II" = "(Un)sorted Outer Union". One very interesting (and surprising) result is that the Universal Table I performed so poorly the authors didn't even bother to include it in their performance graphs. (It performed even worse than the Individual Tables method.) This was due to the large amount of redundant data, which apparently slowed query execution significantly.

Thanks to Dongwon Lee of UCLA for the reference.

7.0 Conclusion

In this paper we have surveyed strategies for transferring data between XML documents and relational databases and have come to the following conclusions:

It is always possible to transfer data using SAX.
SAX-based tools are always faster than DOM-based tools because they: (a) don't have the overhead of instantiating a DOM tree, (b) visit the same or fewer number of nodes, and (c) have less overhead traversing the tree.
SAX-based tools always use less memory than DOM-based tools. However, the reduction in memory might be much less than might be expected. In particular, when transferring data from XML to the database, referential integrity can force the tool to buffer the data for the entire document. Similarly, when transferring data from the database to XML, sibling order can force the tool to buffer the data for the entire document in a result set or DOM tree.
Scalability can be a problem. Although this can be largely eliminated in some cases when transferring data from XML to the database by "tuning" the document, there appears to be no general solution when transferring data from the database to XML and sibling order is stored in the database. Fortunately, scalability might not be a major issue for data-centric applications.

A Sorting type II universal tables in document order

This appendix discusses how to sort type II universal tables in document order. While the algorithm presented here considers order values associated with row elements, it does not consider order values associated with column elements. These must be handled separately by the processing software in the manner described earlier.

Suppose we have a document based on the following tables:

   Table A
   --------
   a   aOrder

   Table B
   -------
   a   bOrder  b

   Table C
   -------
   a   cOrder  c

which represents the following table hierarchy:

     A
    / \
   B   C

Column A.a is the primary key of table A and is used to link table A to tables B and C. Column A.aOrder gives the order that each row in table A occurs in the root element (which is needed if more than one row is retrieved from table A). Columns B.bOrder and C.cOrder give the order in which the row elements for tables B and C occur in the row element for table A. Columns B.b and C.c are data columns for tables B and C. That is, they correspond to column elements within the row elements for tables B and C. For clarity in the universal tables that follow, the values of A.a, B.b, and C.c are strings identifying the table and row. For example, the value of B.b in the first row of table B is "B1" and the value of C.c in the third row of table C is "C3".

To sort rows from tables B and C together, we need to assign their order columns to a single result set column. For example, the following SELECT statements assign both B.bOrder and C.cOrder to the result set column bcOrder:

   SELECT a, aOrder, NULL AS bcOrder, NULL as b, NULL as c FROM A
     UNION ALL
   SELECT A.a, NULL, bOrder, b, NULL FROM A, B WHERE A.a = B.a
     UNION ALL
   SELECT A.a, NULL, cOrder, NULL, c FROM A, C WHERE A.a = C.a
   ORDER BY aOrder, bcOrder

To see why this sorts the results in document order (both hierarchical and sibling order), notice that a given sibling order value identifies a particular sibling within a set of siblings. Thus, a chain of order values identifies a unique node in the document -- each order value in the chain identifies a single sibling at each level, thus leading to a single node.

By assigning the order columns from all sibling tables to a single column in the result set, we are able to sort on this result set column, thereby sorting the rows from multiple sibling tables in sibling order. By placing the sibling order columns in hierarchical order in the ORDER BY clause -- that is, so that all sibling order columns appear to the right of any of their ancestor order columns -- we sort the result set in both sibling and hierarchical order; that is, in document order.

As an example, consider the following XML document, which is constructed from the first rows in tables A, B, and C. Upper case tag names indicate row elements and lower case tag names indicate column elements:

   <A>
      <a>A1</a>
      <B>
         <b>B1</b>
      </B>
      <C>
         <c>C1</c>
      </C>
   </A>

For simplicity, we can represent the above document with the following notation. This is used throughout the rest of this appendix, leaving it as an exercise to the reader to construct the actual documents. (I've always wanted to say that.)

     A1
    /  \
   B1  C1

The universal table for this document is as follows, where Row identifies the table and row the result set row comes from. In this case, the first row comes from table A (the first SELECT statement), the second row comes from tables A and B (the second SELECT statement), and the third row comes from tables A and C (the third SELECT statement).

   Row      a   aOrder   bcOrder   b   c
   ---      --  ------   -------   --  --
   A1       A1    1        --      --  --
   B1       A1    1        2       B1  --
   C1       A1    1        3       --  C1

Notice how each row is uniquely identified by a chain of order values and how the rows (especially the rows from sibling tables B and C) are sorted in the correct order, both with respect to hierarchy and sibling order. Also note that in the first row, the value of the bcOrder column is NULL. In the general case, all order columns at a level below the current level in the hierarchy are NULL. It is therefore important that NULLs sort at the low end of the result set so that higher-level rows appear before lower level rows.

(The values for bcOrder start with 2 because this represents their actual position in the output document. Not included in the example is the column in table A that gives the order of the column element for A.a in the document. This is omitted for clarity. Since there can be more than one such order column per row, it is impossible to sort on these columns and they do not affect the current discussion. As was noted earlier, the processing software must take these values into account, mixing column elements from the parent row with row elements from the child rows.)

Now consider a more complex example, in which there are multiple rows from tables B and C:

         A1
      _/ /\ \_
     /  /  \  \
   B1  C1  C2  B2

The universal table for this document is as follows:

   Row      a   aOrder   bcOrder   b   c
   ---      --  ------   -------   --  --
   A1       A1    1        --      --  --
   B1       A1    1        2       B1  --
   C1       A1    1        3       --  C1
   C2       A1    1        4       --  C2
   B2       A1    1        5       B2  --

Again, notice how each row is uniquely identified by a chain of order values and the rows from tables B and C are sorted correctly with respect to both hierarchical and sibling order.

To see how this works with more levels of nesting, consider the following hierarchy of tables, all of which are constructed in the same manner as the original tables.

                   A
                 /   \
                B     C
               / \   / \
              D   E F   G

The SELECT statements for creating a universal table over this hierarchy are:

   SELECT a, aOrder, NULL as bcOrder, NULL as b, NULL as c,
                     NULL as deOrder, NULL as d, NULL as e,
                     NULL as fgOrder, NULL as f, NULL as g FROM A
    UNION ALL
   SELECT A.a, NULL, bOrder, b, NULL, NULL, NULL,NULL,NULL,NULL,NULL
      FROM A, B WHERE A.a = B.a
    UNION ALL
   SELECT A.a, NULL, cOrder, NULL, c, NULL, NULL,NULL,NULL,NULL,NULL
      FROM A, C WHERE A.a = C.a
    UNION ALL
   SELECT A.a, NULL, bOrder, b, NULL, dOrder, d, NULL,NULL,NULL,NULL
      FROM A, B, D WHERE (A.a = B.a) AND (B.b = D.b)
    UNION ALL
   SELECT A.a, NULL, bOrder, b, NULL, eOrder, NULL, e,NULL,NULL,NULL
      FROM A, B, E WHERE (A.a = B.a) AND (B.b = E.b)
    UNION ALL
   ...
   ORDER BY aOrder, bcOrder, deOrder, fgOrder

Now consider the following document, which has multiple rows from each table except A:

                   A1
          ______/ /  \ \________
         /       /    \         \
       B1      C1      B2        B3
     / | \     /\      /\      / | \
   D1 D2 E1  F1  G1  E2  D3   E3 E4 D4

The universal table for this document is:

    Row    a  aOrder  bcOrder  b   c  deOrder  d   e  fgOrder  f   g
    ---    -- ------  ------- --  --  ------- --  --  ------- --  --
    A1     A1    1      --    --  --    --    --  --   --     --  --
    B1     A1    1      2     B1  --    --    --  --   --     --  --
    D1     A1    1      2     B1  --    2     D1  --   --     --  --
    D2     A1    1      2     B1  --    3     D2  --   --     --  --
    E1     A1    1      2     B1  --    4     --  E1   --     --  --
    C1     A1    1      3     --  C1    --    --  --   --     --  --
    F1     A1    1      3     --  C1    --    --  --   2      F1  --
    G1     A1    1      3     --  C1    --    --  --   3      --  G1
    B2     A1    1      4     B2  --    --    --  --   --     --  --
    E2     A1    1      4     B2  --    2     --  E2   --     --  --
    D3     A1    1      4     B2  --    3     D3  --   --     --  --
    B3     A1    1      5     B3  --    --    --  --   --     --  --
    E3     A1    1      5     B3  --    2     --  E3   --     --  --
    E4     A1    1      5     B3  --    3     --  E4   --     --  --
    D4     A1    1      5     B3  --    4     D4  --   --     --  --

This example shows clearly the chain of order values that identifies each row element. It also shows something not visible in earlier examples, and that is that a chain can have more order columns than an element is deep. This is an artifact of flattening a hierarchy into a linear set of ORDER BY columns. For example, the chain of columns identifying each row is aOrder, bcOrder, deOrder, fgOrder. However, deOrder and fgOrder are parallel and exclusive -- that is, they represent sets of elements that are at the same level of the hierarchy but not in the same set of siblings. Thus, one of deOrder and fgOrder will always be NULL, so a given chain actually contains only as many non-NULL values as an element is deep.

The rules for sorting the universal result set in document order (both hierarchical and sibling order) are as follows:

Each set of sibling tables must share a sibling order column in the universal table. If more than one row is retrieved from the root table, a sibling order column from this table must also be included.
The sibling order columns must appear in the ORDER BY clause in hierarchical order. That is, the order column for a given set of siblings must appear before the order columns for any descendant sets of siblings. For a given set of order columns, there are usually multiple ORDER BY clauses that satisfy this requirement.
There is no requirement that a set of siblings be ordered or that a set of order values be unique. In the first case, the siblings will be created in the order they are returned from the database. In the second case, all siblings with the same order value will be created in the order they are returned from the database. However, because a chain of order values identifies a row element, it must be unique in order to preserve hierarchical order. In the first case (unordered siblings), this is done by including the join column (from parent to sibling) in the ORDER BY clause in place of the order column. In the second case (non-unique order values), this is done by including the join column in the ORDER BY clause in addition to the order column.

Ronald Bourret

Technical writing, Software development, and Consulting