SCDJWS Study Guide: JAXP


Printer-friendly version Printer-friendly version | Send this 
article to a friend Mail this to a friend


Previous Next vertical dots separating previous/next from contents/index/pdf Contents

Document Object Model (DOM)

The DOM is a set of interfaces that exposes an XML document as a tree structure comprised of nodes for accessing and manipulating these objects and their inter-relationships. Some nodes in the tree may contain other nodes (depending on node type). Each node has a type that is represented by an interface in the package org.w3c.dom, such as Element, Attribute, Comment, and Text.

This DOM tree object representation can then be manipulated just like any tree data structure. The DOM allows you to programmatically navigate the DOM tree, to randomly access nodes, and to modify the XML document. Each document node contains one root element mode, zero or more comment and processing instruction nodes, and zero or one doctype node.

DOM

SAX consists of an event-based set of callbacks, while DOM has an in-memory tree structure. With SAX, there's no a data structure created by SAX to work on. SAX does not allow for random access to particular pieces of data from the XML document and the ability to modify the XML document, but DOM does.

The org.w3c.dom.Document class represents an XML document and is made up of DOM nodes that represent the elements, attributes, and other XML constructs. With DOM, JAXP is responsible only for returning a DOM Document object from parsing. The downside of using DOM is that it is extremely memory and CPU intensive, since building the DOM requires that the entire XML structure be read and held in memory.

DOM Parser Processing

  • Creates a DocumentBuilder instance, which is DOM parser, using a specific vendor's parser implementation
  • The parser parses the document and returns a DOM org.w3c.dom.Document object.
  • The entire document is stored in memory.
  • DOM methods and interfaces are used to extract data from this object

The DocumentBuilderFactory Factory Class

DocumentBuilderFactory is an abstract class (Abstract Factory Pattern) defines a factory API that enables applications to obtain a parser that produces DOM object trees from XML documents.  An new instance of a factory DocumentBuilderFactory is obtained bye the static method DocumentBuilderFactory.newInstance().  This method uses the following ordered lookup procedure to determine the DocumentBuilderFactory implementation class to load:

  • Use the javax.xml.parsers.DocumentBuilderFactory system property.
  • Use the properties file "lib/jaxp.properties" in the JRE directory. This configuration file is in standard java.util.Properties format and contains the fully qualified name of the implementation class with the key being the system property defined above.
  • Use the Services API (as detailed in the JAR specification), if available, to determine the classname. The Services API will look for a classname in the file META-INF/services/javax.xml.parsers.DocumentBuilderFactory in jars available to the runtime.
  • Platform default DocumentBuilderFactory instance.

Then the factory can be configured to handle validation and namespaces (there are several other settings that can be selected, please reference the DocmentBuilderFactory class for details). Once you have the new created factory, a DocumentBuilder instance, which is used to parse the XML file, can be obtained from the DocmentBuilderFactory.newDocumentBuilder() method. Once you have DocumentBuilder instance, XML can be parsed form a variety of input sources. The DOM Document object can be obtained from this instance.

import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
// JAXP
import javax.xml.parsers.FactoryConfigurationError;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
// DOM
import org.w3c.dom.Document;
import org.w3c.dom.DocumentType;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public class TestDOMParsing {
    public static void main(String[] args) {
        try {
            if (args.length != 1) {
                System.err.println ("Usage: java TestDOMParsing " +
                                    "[filename]");
                System.exit (1);
            }
            // Get Document Builder Factory
            DocumentBuilderFactory factory =
                DocumentBuilderFactory.newInstance();
            // Turn on validation, and turn off namespaces
            factory.setValidating(true);
            factory.setNamespaceAware(false);
            DocumentBuilder builder = factory.newDocumentBuilder();
            Document doc = builder.parse(new File(args[0]));
            // Print the document from the DOM tree and
            //   feed it an initial indentation of nothing
            printNode(doc, "");
        } catch (ParserConfigurationException e) {
            System.out.println("The underlying parser does not " +
                               "support the requested features.");
        } catch (FactoryConfigurationError e) {
            System.out.println("Error occurred obtaining Document " +
                               "Builder Factory.");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    private static void printNode(Node node, String indent)  {
        // print the DOM tree
    }
}

Two problems can arise with this code (as with SAX in JAXP): a FactoryConfigurationError and a ParserConfigurationException. The cause of each is the same as it is with SAX. Either a problem is present in the implementation classes (resulting in a FactoryConfigurationError), or the parser provided doesn't support the requested features (resulting in a ParserConfigurationException).

DocumentBuilder

Once you have a DOM factory, you can obtain a DocumentBuilder instance. The methods available to a DocumentBuilder instance are very similar to those available to its SAX counterpart.  Instead they return a DOM Document instance representing the XML document that was parsed.

The only other difference is that two methods are provided for SAX-like functionality:

  • setErrorHandler(), which takes a SAX ErrorHandler implementation to handle problems that might arise in parsing
  • setEntityResolver(), which takes a SAX EntityResolver implementation to handle entity resolution.

Listing 4 shows examples of these methods in action:

// Get a DocumentBuilder instance
DocumentBuilder builder = builderFactory.newDocumentBuilder();
// Find out if validation is supported
boolean isValidating = builder.isValidating();
// Find out if namespaces are supported
boolean isNamespaceAware = builder.isNamespaceAware();
// Set a SAX ErrorHandler
builder.setErrorHandler(myErrorHandlerImpl);
// Set a SAX EntityResolver
builder.setEntityResolver(myEntityResolverImpl);
// Parse, in a variety of ways
// Use a file
Document doc = builder.parse(new File(args[0]));
// Use a SAX InputSource
Document doc = builder.parse(mySaxInputSource);
// Use an InputStream
Document doc = builder.parse(myInputStream, myDefaultHandlerInstance);
// Use a URI
Document doc = builder.parse("http://www.newInstance.com/xml/doc.xml");

You invoke the parser by calling the parse method of the document builder, supplying an input stream, URI (represented as a string), or org.xml.sax.InputSource. The Document class represents the parsed result in a tree structure.

Obtain the root node of the tree by “document.getDocumentElement()”. This returns an Element, which is a subclass of the more general Node class that represents an XML element.

DOM Node Structure

DOM provides an interface to navigate and manipulate the hierarchical structure of markup. Most DOM objects are nodes, providing generalized means of navigating from one node to another, changing the children of the node, and other ways of using and modifying the structure of the content.

The DOM presents a document as a hierarchy of node objects.

The following table lists the different W3C node types, and which node types they may have as children:

Node type

Description

Children

Document

Represents the entire document (it is the root-node of the DOM tree)

Element (max. one), ProcessingInstruction, Comment, DocumentType

DocumentFragment

Represents a "lightweight" Document object, which can hold a portion of a document

Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference

DocumentType

Represents a list of the entities that are defined for the document

None

EntityReference

Represents an entity reference

Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference

Element

Represents an element

Element, Text, Comment, ProcessingInstruction, CDATASection, EntityReference

Attr

Represents an attribute

Text, EntityReference

ProcessingInstruction

Represents a "processing instruction"

None

Comment

Represents a comment

None

Text

Represents textual content (character data) in an element or attribute

None

CDATASection

Represents a block of text that may contains characters that would otherwise be treated as markup

None

Entity

Represents an entity

Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference

Notation

Represents a notation declared in the DTD

None

The Node interface is the primary datatype for the entire Document Object Model. It represents a single node in the document tree. While all objects implementing the Node interface expose methods for dealing with children, not all objects implementing the Node interface may have children. For example, Text nodes may not have children, and adding children to such nodes results in a DOMException being raised.

The attributes nodeName, nodeValue and attributes are included as a mechanism to get at node information without casting down to the specific derived interface. In cases where there is no obvious mapping of these attributes for a specific nodeType (e.g., nodeValue for an Element or attributes for a Comment ), this returns null. Note that the specialized interfaces may contain additional and more convenient mechanisms to get and set the relevant information.

The values of nodeName, nodeValue, and attributes vary according to the node type as follows:

nodeType

nodeName

nodeValue

attributes

ATTRIBUTE_NODE

name of attribute

value of attribute

null

CDATA_SECTION_NODE

"#cdata-section"

content of the CDATA Section

null

COMMENT_NODE

"#comment"

content of the comment

null

DOCUMENT_NODE

"#document"

null

null

DOCUMENT_FRAGMENT_NODE

"#document-fragment"

null

null

DOCUMENT_TYPE_NODE

document type name

null

null

ELEMENT_NODE

tag name

null

NamedNodeMap

ENTITY_NODE

entity name

null

null

ENTITY_REFERENCE_NODE

name of entity referenced

null

null

NOTATION_NODE

notation name

null

null

PROCESSING_INSTRUCTION_NODE

target

entire content excluding the target

null

TEXT_NODE

"#text"

content of the text node

null

Examine various properties of the node. These properties include the name of the element (getNodeName), the node type (getNodeType; compare the return value to predefined constants in the Node class), the node value (getNodeValue; e.g., for text nodes the value is the string between the element's start and end tags), the attributes used by the element's start tag (getAttributes), and the child 6. nodes (getChildNodes; i.e., the elements contained between the current element's start and end tags). You can recursively examine each of the child nodes.

Navigator DOM Tree

Once we have the Document DOM object we can simply traverse the structure as we would any tree. Calling getDocumentElement returns the root element. From there, we can get a NodeList of child nodes and proceed from there. At the leaf level of our DOM structure, we can find Text objects, which inherit from Node. You may note here, the user needs to have prior knowledge of the structure of the data in order to be able to know how to access it, unlike in SAX, which just reacts to what it finds.

Modification of DOM Tree

One of the advantages of DOM parser is that we can modify our data structure adding child nodes (appendChild), removing child nodes (removeChild), and changing the node's value (setNodeValue). Unfortunately, however, DOM doesn't provide a standard method of writing out a DOM structure in textual format. So, you have to either do it yourself (printing out a "<", the node name, the attribute names and values with equal signs between them and quotes around the values, a ">", etc.) or use one of the many existing packages that generate text from a DOM element.


Previous Next vertical dots separating previous/next from contents/index/pdf Contents

  |   |