Saturday, August 21, 2010

Faster XPaths with VTD-XML

I've recently started using VTD-XML for applying XPaths on large XML documents. DOM is a memory hog and is too slow. However, VTD-XML allows you to run XPaths and provides random access to nodes, similar to DOM, but much more efficiently. You can't apply XPaths with a SAX parser nor can you access nodes randomly or traverse the document easily.

VTD-XML was 60 times faster compared to DOM when processing my XML document (20MB).

This post shows you how to use VTD-XML for fast XPath evaluation.

Sample XML:
I will use the following XML document in the examples below.

<?xml version="1.0"?>
 <book id="bk101">
  <author>Gambardella, Matthew</author>
  <author>Doe, John</author>
  <title>XML Developer's Guide</title>
 <book id="bk102">
  <author>Ralls, Kim</author>
  <title>Midnight Rain</title>
 <book id="bk103">
  <author>Corets, Eva</author>
  <title>Maeve Ascendant</title>
Loading the XML document:
The following code parses the XML file and creates the navigator and autopilot objects.
final VTDGen vg = new VTDGen();
vg.parseFile("books.xml", false);
final VTDNav vn = vg.getNav();
final AutoPilot ap = new AutoPilot(vn);
Selecting all titles:
Print out all the title nodes using an XPath expression of /catalog/book/title. First call selectXPath to compile the expression and then use evalXPath to move the cursor to the selected nodes in the result.
while (ap.evalXPath() != -1) {
  int val = vn.getText();
  if (val != -1) {
    String title = vn.toNormalizedString(val);
Selecting all book ids and authors:
This one is a bit more involved as a book can have many authors. In the code below, I first run an XPath to select the books and then iterate over the children, selecting the author nodes.
while (ap.evalXPath() != -1) {
  int val = vn.getAttrVal("id");
  if(val != -1){
    String id = vn.toNormalizedString(val);
    System.out.println("Book id: " + id);

      val = vn.getText();
      if(val != -1){
        String author = vn.toNormalizedString(val);
        System.out.println("\tAuthor:" + author);
The output is:
Book id: bk101
 Author:Gambardella, Matthew
 Author:Doe, John
Book id: bk102
 Author:Ralls, Kim
Book id: bk103
 Author:Corets, Eva
  1. Anonymous4:14 PM

    Could you please let me know whether it is free software or licensed one?

  2. I think if vn.toElement(VTDNav.FIRST_CHILD,"author") fails, according to the documentation of "toElement()", there's no position change, so we should put "vn.toElement(VTDNav.PARENT)" inside the "if"