Pxp_documentis hard to read without some background information. This introduction is supposed to provide this information.
Also note that there is also a stream representation of XML. See
Intro_events for more about this.
The structure of document trees
In a document parsed with the default parser settings every node
represents either an element or a character data section. There are
two classes implementing the two aspects of nodes:
Pxp_document.data_impl. There are configurations which
allow more node types to be created, in particular processing
instruction nodes, comment nodes, and super root nodes, but these are
discussed later. Note that you can always add these extra node types
yourself to the node tree no matter what the parser configuration
The following figure
shows an example how
a tree is constructed from element and data nodes. The circular areas
represent element nodes whereas the ovals denote data nodes. Only elements
may have subnodes; data nodes are always leaves of the tree. The subnodes
of an element can be either element or data nodes; in both cases the O'Caml
objects storing the nodes have the class type
Attributes (the clouds in the picture) do not appear as nodes of the tree, and one must use special access methods to get them.
You would get such a tree by parsing with
let config = Pxp_types.default_config let source = Pxp_types.from_string "An orangeCherries
" let spec = Pxp_tree_parser.default_spec let doc = Pxp_tree_parser.parse_document_entity config source spec let root = doc#root
config record sets a lot of parsing options. A number of these
options are explained below. The
source argument says from where
the parsed text comes. For the mysterious
spec parameter see below.
The parser returns
doc, which is a
have to call its
root method to get the root of the tree. Note that
there are other parsing functions that return directly nodes; these
are intended for parsing XML fragments, however. For the usual closed
XML documents use a function that returns a document.
root is, as other nodes of the tree, an object instance of the
Pxp_document.node class type.
What about other things that can occur in XML text? As mentioned,
by default only elements and data nodes appear in the tree, but it is
possible to enable more node types by setting appropriate
enable_comment_nodes, however, comments are added to the tree. There is a special node type for comments.
<? ... ?>parentheses) are not ignored, but normally no nodes are created for them. The instructions are only gathered up, and attached to the surrounding node, so one can check for their presence but not for their exact location. By setting the config option
enable_pinstr_nodes, however, processing instructions are added to the tree as normal nodes. There is also a special node type for them.
enable_super_root_node. The node is called super root node, and is also a special type of node.
<![CDATA[some text]]>) are simply added to the surrounding data node, so they do not appear as nodes of their own.
&) are automatically resolved, and the resolution is added to the surrounding node
a & b <!-- comment --> c <![CDATA[<> d]]>
is represented by only one data node, for instance (for the default case where no comment nodes are created). Of course, you can create document trees manually which break this invariant; it is only the way the parser forms the tree.
All types of nodes are represented by the same Ocaml objects of
Pxp_document.node. The method
a hint which type of node the object is. See the type
Pxp_document.node_type for details how the mentioned node types are
reflected by this method. For instance, for elements this method
T_element n where
n is the name of the element.
Note that this means formally that all access methods are implemented
for all node types. For example, you can get the attributes of
data nodes by calling the
although this does not
make sense. This problem is resolved on a case-by-case basis by
either returning an "empty value" or by raising appropriate
For the chosen typing it is not possible to define slimmer class types
that better fit the various node types.
Attributes are usually represented as pairs
string * att_value of names and values. Here,
Pxp_core_types.S.att_value is a conventional variant type. There are lots of
access methods for attributes, see below. It is optionally possible
to wrap the attributes as nodes (method
Pxp_document.node.attributes_as_nodes), but even in this case the attributes
are outside the regular document tree.
Normally, the processing instructions are also not included
into the document tree. They are considered as an extra property of the
element to which they are attached, and can be retrieved by the
method of the element node. If this way of handling processing instructions
is not exact enough, the parser can optionally create processing instruction
nodes that are regular members of the document tree.
An overview over some relevant access methods:
Pxp_document.node.parent): returns the parent object of the node it is called on
Pxp_document.node.root): returns the root of the tree the node is member of
Pxp_document.node.sub_nodes): returns the children of the node
Pxp_document.node.previous_node): returns the left sibling
Pxp_document.node.next_node): returns the right sibling
Pxp_document.node.node_position): returns the ordinal position of this node as child of the parent
Pxp_document.node.node_path): returns the positional path of this node in the whole tree
Pxp_document.node.node_type): returns the type of the node
Pxp_document.node.position): returns the position of the node in the parsed XML text
Pxp_document.node.data): returns the data contents of data nodes
Pxp_document.node.attributes): returns the attributes of elements
Pxp_document.node.attributes_as_nodes): also returns the attributes, but represented as a list of nodes residing outside the tree
Pxp_document.node.comment): returns the text of the XML comment
Pxp_document.node.validate): validates the element locally
Pxp_document.node.localname): returns the local name of the element in the namespace
Pxp_document.node.namespace_uri): returns the namespace URI of the node
Pxp_document.node.namespace_scope): returns the scope object with more namespace query methods
Pxp_document.node.namespaces_as_nodes): returns the namespaces this node is member of, and the namespaces are represented as list of nodes
Trees are mutable, and nodes are mutable. Note that the tree is not
automatically (re-)validated when it is changed. You have to explicitly
call validation methods, or the
Pxp_document.validate function for the
Pxp_document.node.create_element): called on an element node, this method creates a new tree only consisting of an element, and the only node of the tree is an object of the same class as this node
Pxp_document.node.create_data): same for data nodes
Pxp_document.node.orphaned_clone): creates a copy of the subtree starting at this node
Links between nodes
The node tree has links in both directions: Every node has a link to its parent (if any), and it has links to the subnodes (see the following picture). Obviously, this doubly-linked structure simplifies the navigation in the tree; but has also some consequences for the possible operations on trees.
Because every node must have at most one parent node,
operations are illegal if they violate this condition. The following figure
shows on the left side
y is added to
x as new subnode
which is allowed because
y does not have a parent yet. The
right side of the picture illustrates what would happen if
had a parent node; this is illegal because
y would have two
parents after the operation.
remove operation simply removes the links between two nodes. In the
following picture the node
x is deleted from the list of subnodes of
y. After that,
x becomes the root of the
subtree starting at this node.
It is also possible to make a clone of a subtree; illustrated in the next picture. In this case, the clone is a copy of the original subtree except that it is no longer a subnode. Because cloning never keeps the connection to the parent, the clones are called orphaned.
Optional features of document trees
As already pointed out, the parser does only create element and data nodes by
default. The configuration of the parser can be controlled by the
Pxp_types.config record. There are a number of optional features that
change the way the document trees are constructed by the parser:
Note that the parser configuration only controls the parser. If you create trees of your own, you can simply add all the additional node types to the tree without needing to enable these features.
enable_super_root_nodeis set, the extra super root node is generated at the top of the tree. This node has type
enable_comment_nodeslets the parser add comment nodes when it parses comments. These nodes have type
enable_pinstr_nodeschanges the way processing instructions are added to the document. Instead of attaching such instructions to their containing elements as additional properties, this mode forces the parser to create real nodes of type
drop_ignorable_whitespace(enabled by default) can be turned off. It controls whether the parser skips over so-called ignorable whitespace. The XML standard allows that elements contain whitespace characters even if they are declared in the DTD not to contain character data. Because of this, the parser considers such whitespace as ignorable detail of the XML instance, and drops the characters silently. You can change this by setting
false. In this case, every character of the XML instance will be accepted by the parser and will be added to a data node of the document tree.
position. As this requires a lot of memory, it is possible to turn this off by setting
There are a number of further configuration options; however, these options do not change the structure of the document tree.
Optional features of nodes
The following features exist per node, and are simply invoked by using the methods dealing with them.
attributes_as_nodesreturns the attributes wrapped into node objects. Note that these nodes are read-only.
The DTD object contains the declarations of elements, attribute lists, entities, and notations. Furthermore, the DTD knows whether the document is flagged as "standalone". As a PXP extension to classic XML processing, the DTD may specify a mixed mode between "validating mode" and "well-formedness mode". It is possible to allow non-declared elements in the document, but to check declared elements against their declaration at the same time. Moreover, there is a similar feature for attribute lists; you can allow non-declared attributes and check declared attributes. (Well, the whole truth is that the parser always works in this mix mode, and that the "validating mode" and the "well-formedness mode" are only the extremes of the mix mode.)
Creating nodes and trees
Often, the parser creates the trees, but on occasion it is useful to create trees manually. We explain here only the basic mechanism. There is a nice camlp4 syntax extension called pxp-pp (XXX: LINK) allowing for a much better notation in programs.
The most basic way of creating new nodes are the
create_other methods of nodes. It is not recommended
to use them directly, however, as they are very primitive.
Pxp_document module there are a number of functions creating
individual nodes (without children), the node constructors:
Pxp_document.create_element_node: creates an element node
Pxp_document.create_data_node: creates a data node
Pxp_document.create_comment_node: creates a comment node
Pxp_document.create_pinstr_node: creates a processing instruction node
Pxp_document.create_super_root_node: creates a super root node
The node constructors must be equipped with all required data to create the requested type of node. This includes the data that would have been available in the textual XML representation, and some of the meta data passed to the parsers, and meta data implicitly generated by the parsers. For an element, this is at minimum:
After some nodes have been created, they can be combined to more complex
trees by mutation methods (e.g.
As mentioned, a node must always be connected with a DTD object, even if no validation checks will be done. It is possible to create DTD objects that do not impose restrictions on the document:
let dtd = Pxp_dtd_parser.create_empty_dtd config
dtd # allow_arbitrary
Even such a DTD object can contain entity definitions, and can demand
a certain way of dealing with namespaces. Also, the character encoding
of the nodes is taken from the DTD. See
DTD methods, and
Pxp_dtd_parser for convenient ways to create DTD
objects. Note that all nodes of a tree must be connected to the same
PXP is not restricted to using built-in classes for nodes. When the
parser is invoked and a tree is built, it is looked up in a so-called
document model specification how the new objects have to be
Pxp_document.spec. Basically, it is a list of sample
objects to use (which are called exemplars), and these objects are
cloned when a node is actually created.
When calling the node constructors directly (bypassing the parser), the document model specification has also to be passed to them as argument. It is used in the same way as the parser uses it.
Every node in a tree has a so-called extension. By default, the extension is practically empty and only present for formal uniformity. However, one can also define custom extension classes, and effectively add new methods to the node classes.
Node extensions are explained in detail here:
As an option, PXP processes namespace declarations in XML text.
See this separate introduction for details:
Details of the mapping from XML text to the tree representation
If an element declaration does not allow the element to contain character data, the following rules apply.
If the element must be empty, i.e. it is declared with the
EMPTY, the element instance must be effectively
empty (it must not even contain whitespace characters). The parser guarantees
that a declared
EMPTY element never contains a data
node, even if the data node represents the empty string.
If the element declaration only permits other elements to occur within that element but not character data, it is still possible to insert whitespace characters between the subelements. The parser ignores these characters, too, and does not create data nodes for them.
Example. Consider the following element types:
<!ELEMENT x ( #PCDATA | z )* >
<!ELEMENT y ( z )* >
<!ELEMENT z EMPTY>
x may contain character data, the keyword
#PCDATA indicates this. The other types are character-free.
The XML term
will be internally represented by an element node for
with three subnodes: the first
z element, a data node
containing the space character, and the second
In contrast to this, the term
is represented by an element node for
y with only
two subnodes, the two
z elements. There
is no data node for the space character because spaces are ignored in the
By setting the parser option
false, the behaviour of the parser is changed such that
even ignorable whitespace characters are represented by data nodes.
The representation of character data
The XML specification allows all Unicode characters in XML
texts. This parser can be configured such that UTF-8 is used to represent the
characters internally; however, the default character encoding is
ISO-8859-1. (Currently, no other encodings are possible for the internal string
representation; the type
the possible encodings. Principally, the parser could use any encoding that is
ASCII-compatible, but there are currently only lexical analyzers for UTF-8 and
ISO-8859-1. It is currently impossible to use UTF-16 or UCS-4 as internal
encodings (or other multibyte encodings which are not ASCII-compatible) unless
major parts of the parser are rewritten - unlikely...)
The internal encoding may be different from the external encoding (specified
in the XML declaration
<?xml ... encoding="..."?>); in
this case the strings are automatically converted to the internal encoding.
If the internal encoding is ISO-8859-1, it is possible that there are
characters that cannot be represented. In this case, the parser ignores such
characters and prints a warning (to the
object that must be passed when the parser is called).
The XML specification allows lines to be separated by single LF characters, by CR LF character sequences, or by single CR characters. Internally, these separators are always converted to single LF characters.
The parser guarantees that there are never two adjacent data nodes; if necessary, data material that would otherwise be represented by several nodes is collapsed into one node. Note that you can still create node trees with adjacent data nodes; however, the parser does not return such trees.
Note that CDATA sections are not represented specially; such sections are added to the current data material that is being collected for the next data node.
The representation of entities within documents
Entities are not represented within documents! If the parser finds an entity reference in the document content, the reference is immediately expanded, and the parser reads the expansion text instead of the reference.
The representation of attributes
As attribute values are composed of Unicode characters, too, the same problems with the character encoding arise as for character material. Attribute values are converted to the internal encoding, too; and if there are characters that cannot be represented, these are dropped, and a warning is printed.
Attribute values are normalized before they are returned by
attribute. First, any remaining entity
references are expanded; if necessary, expansion is performed recursively.
Second, newline characters (any of LF, CR LF, or CR characters) are converted
to single space characters. Note that especially the latter action is
prescribed by the XML standard (but
is not converted
such that it is still possible to include line feeds into attributes).
The representation of processing instructions
Processing instructions are parsed to some extent: The first word of the PI is called the target, and it is stored separated from the rest of the PI:
The exact location where a PI occurs is not represented (by default). The parser attaches the PI to the object that represents the embracing construct (an element, a DTD, or the whole document); that means you can find out which PIs occur in a certain element, in the DTD, or in the whole document, but you cannot lookup the exact position within the construct.
If you require the exact location of PIs, it is possible to
create regular nodes for them instead of attaching them to the surrounding
node as property. This mode is controlled by the option
enable_pinstr_nodes. The additional nodes have the node type
T_pinstr target, and are created
from special exemplars contained in the
The representation of comments
Normally, comments are not represented; they are dropped by default.
However, if you require comment in the document tree, it is possible to create
T_comment nodes for them. This mode can be specified by the
enable_comment_nodes. Comment nodes are created from
special exemplars contained in the
Pxp_document.spec). You can access the contents of comments through the
These attributes are not supported specially; they are handled like any other attribute.
Note that the utility function