6. Processing XML Documents

DOM

Recall that an XML parser builds a tree-like structure from an input XML document. Like XML, the tree itself conforms to an XML standard called the Document Object Model.

A DOM tree has the form described earlier, accept each attribute of an element node is considered to be a child node of the element node.

Each type of node is specified by an API. The relationships between these APIs is given by the following UML class diagram:

There are DOM parsers (i.e., XML-to-DOM tree parsers) implemented in many languages: C, Java, C++, JavaScript, VBScript, etc.)

Here are a few lines of Java code that show how the top-level document node of a DOM tree is obtained from an XML document:

String xmlFile = "mydoc.xml";
// obtain parser factory:
 DocumentBuilderFactory factory =
   DocumentBuilderFactory.newInstance();
//obtain parser:
DocumentBuilder builder =
   factory.newDocumentBuilder();
// parse an xml document into a DOM tree:
Document doc =
   builder.parse(new File(xmlFile));

A programmer might create a visitor or tree walker object that traverses a DOM tree, processing each node as required:

public class Visitor {
   protected String name, value;
   protected int depthCounter = 0;
   public void visit(Document doc) {
      try {
         visit((Node)doc.getDocumentElement());
      } catch(VisitorException e) {
         handle(e);
      }
   }
   public void visit(Node node) throws VisitorException {
      // see below
   }
   public void visit(NodeList nodes) throws VisitorException {
      for(int i = 0; i < nodes.getLength(); i++) {
         depthCounter++;
         visit(nodes.item(i));
         depthCounter--;
      }
   }
   // overridables:
   protected void visit(Element node) throws VisitorException { }
   protected void visit(Text node) throws VisitorException { }
   protected void visit(Attr node) throws VisitorException { }
   protected void handle(VisitorException e) {
      System.err.println("visitor exception: " + e);
   }
   // etc.
}

Visiting a node depends on the name (tag), value (content), and type (class) of the node:

   public void visit(Node node) throws VisitorException {
      name = node.getNodeName();
      value = node.getNodeValue();
      if (value != null) value = value.trim();
      switch (node.getNodeType()) {
         case Node.ELEMENT_NODE:
            Element elem = (Element) node;
            depthCounter++;
            visit(elem);
            NamedNodeMap attributeNodes =
               node.getAttributes();
            for(int i = 0; i < attributeNodes.getLength(); i++) {
               Attr attribute = (Attr) attributeNodes.item(i);
               depthCounter++;
               visit(attribute);
               depthCounter--;
            }
            depthCounter--;
            visit(node.getChildNodes());
            break;
         case Node.CDATA_SECTION_NODE:
         case Node.TEXT_NODE:
            Text text = (Text) node;
            depthCounter++;
            visit(text);
            depthCounter--;
            break;
      } // switch
   } // visit node

We can customize our visitor by overriding the overridables in Java extensions of the Visitor class:

SAX (Simple API for XML)

Building and traversing a large DOM tree can be overkill in many applications. For this reason a second standard API exists for event-based parsers. A SAX parser traverses an XML document without building a tree. As different types of nodes are encountered, the SAX parser broadcasts event notifications. It's the job of the programmer to create and register handlers for the types of events he is interested in.

The following class extends the do-nothing handlers of the DefaultHandler class by redefining the handlers for entering and exiting element node events as well as the handler that's called when a text node is encountered. These handlers print the document showing levels of nesting by indentation:

public class SAXDemo extends DefaultHandler {
   String prefix = "..........................";
   int depth = 0;
   StringBuffer textBuffer;
   public static void main(String argv[]) {
      if (argv.length != 1) {
         System.err.println("Usage: cmd filename");
         System.exit(1);
      }
      // Use an instance of ourselves as the SAX event handler
      DefaultHandler handler = new SAXDemo();
      // Use the default (non-validating) parser
      SAXParserFactory factory = SAXParserFactory.newInstance();
      try {
         // Parse the input
         SAXParser saxParser = factory.newSAXParser();
         saxParser.parse( new File(argv[0]), handler );
      }
      catch (Throwable t) {
         t.printStackTrace();
      }
   }

   public void startElement(
     String namespaceURI,
     String sName, // simple name
     String qName, // qualified name
     Attributes attrs)
  throws SAXException {
     depth++;
     System.out.print(prefix.substring(0, depth));
     System.out.println(qName + " node detected");
     // System.out.println(sName + " = sname");
     int n = attrs.getLength();
     for(int i = 0; i < n; i++) {
        depth++;
        System.out.print(prefix.substring(0, depth));
        System.out.println(
         "attribute node: (" +
         attrs.getQName(i) + ", " + attrs.getValue(i) + ")");
        depth--;
     }
  }

  public void endElement(
     String namespaceURI,
     String sName, // simple name
     String qName  // qualified name
     )
  throws SAXException {
     if (textBuffer != null) {
        String s = ""+textBuffer;
        s = s.trim();
        if (!s.equals("")) {
           depth++;
           System.out.print(prefix.substring(0, depth));
           System.out.println("text node: " + s);
           depth--;
        }
        textBuffer = null;
     }
     depth--;
     //System.out.println("===================");
  }

  public void characters(char buf[], int offset, int len)
  throws SAXException
  {
        String s = new String(buf, offset, len);
        if (textBuffer == null) {
          textBuffer = new StringBuffer(s);
        } else {
          textBuffer.append(s);
    }
  }
}