Plasma GitLab Archive
Projects Blog Knowledge

2.2. How to parse a document from an application

Let me first give a rough overview of the object model of the parser. The following items are represented by objects:

  • Documents: The document representation is more or less the anchor for the application; all accesses to the parsed entities start here. It is described by the class document contained in the module Pxp_document. You can get some global information, such as the XML declaration the document begins with, the DTD of the document, global processing instructions, and most important, the document tree.

  • The contents of documents: The contents have the structure of a tree: Elements contain other elements and text[1]. The common type to represent both kinds of content is node which is a class type that unifies the properties of elements and character data. Every node has a list of children (which is empty if the element is empty or the node represents text); nodes may have attributes; nodes have always text contents. There are two implementations of node, the class element_impl for elements, and the class data_impl for text data. You find these classes and class types in the module Pxp_document, too.

    Note that attribute lists are represented by non-class values.

  • The node extension: For advanced usage, every node of the document may have an associated extension which is simply a second object. This object must have the three methods clone, node, and set_node as bare minimum, but you are free to add methods as you want. This is the preferred way to add functionality to the document tree[2]. The class type extension is defined in Pxp_document, too.

  • The DTD: Sometimes it is necessary to access the DTD of a document; the average application does not need this feature. The class dtd describes DTDs, and makes it possible to get representations of element, entity, and notation declarations as well as processing instructions contained in the DTD. This class, and dtd_element, dtd_notation, and proc_instruction can be found in the module Pxp_dtd. There are a couple of classes representing different kinds of entities; these can be found in the module Pxp_entity.

Additionally, the following modules play a role:

  • Pxp_yacc: Here the main parsing functions such as parse_document_entity are located. Some additional types and functions allow the parser to be configured in a non-standard way.

  • Pxp_types: This is a collection of basic types and exceptions.

There are some further modules that are needed internally but are not part of the API.

Let the document to be parsed be stored in a file called doc.xml. The parsing process is started by calling the function

val parse_document_entity : config -> source -> 'ext spec -> 'ext document
defined in the module Pxp_yacc. The first argument specifies some global properties of the parser; it is recommended to start with the default_config. The second argument determines where the document to be parsed comes from; this may be a file, a channel, or an entity ID. To parse doc.xml, it is sufficient to pass from_file "doc.xml".

The third argument passes the object specification to use. Roughly speaking, it determines which classes implement the node objects of which element types, and which extensions are to be used. The 'ext polymorphic variable is the type of the extension. For the moment, let us simply pass default_spec as this argument, and ignore it.

So the following expression parses doc.xml:

open Pxp_yacc
let d = parse_document_entity default_config (from_file "doc.xml") default_spec
Note that default_config implies that warnings are collected but not printed. Errors raise one of the exception defined in Pxp_types; to get readable errors and warnings catch the exceptions as follows:
class warner =
  object 
    method warn w =
      print_endline ("WARNING: " ^ w)
  end
;;

try
  let config = { default_config with warner = new warner } in
  let d = parse_document_entity config (from_file "doc.xml") default_spec
  in
    ...
with
   e ->
     print_endline (Pxp_types.string_of_exn e)
Now d is an object of the document class. If you want the node tree, you can get the root element by
let root = d # root
and if you would rather like to access the DTD, determine it by
let dtd = d # dtd
As it is more interesting, let us investigate the node tree now. Given the root element, it is possible to recursively traverse the whole tree. The children of a node n are returned by the method sub_nodes, and the type of a node is returned by node_type. This function traverses the tree, and prints the type of each node:
let rec print_structure n =
  let ntype = n # node_type in
  match ntype with
    T_element name ->
      print_endline ("Element of type " ^ name);
      let children = n # sub_nodes in
      List.iter print_structure children
  | T_data ->
      print_endline "Data"
  | _ ->
      (* Other node types are not possible unless the parser is configured
         differently.
       *)
      assert false
You can call this function by
print_structure root
The type returned by node_type is either T_element name or T_data. The name of the element type is the string included in the angle brackets. Note that only elements have children; data nodes are always leaves of the tree.

There are some more methods in order to access a parsed node tree:

  • n # parent: Returns the parent node, or raises Not_found if the node is already the root

  • n # root: Returns the root of the node tree.

  • n # attribute a: Returns the value of the attribute with name a. The method returns a value for every declared attribute, independently of whether the attribute instance is defined or not. If the attribute is not declared, Not_found will be raised. (In well-formedness mode, every existing attribute is considered as being implicitly declared with type CDATA, so you will get either Value s or an exception Not_found.)

    The following return values are possible: Value s, Valuelist sl , and Implied_value. The first two value types indicate that the attribute value is available, either because there is a definition a="value" in the XML text, or because there is a default value (declared in the DTD). Only if both the instance definition and the default declaration are missing, the latter value Implied_value will be returned.

    In the DTD, every attribute is typed. There are single-value types (CDATA, ID, IDREF, ENTITY, NMTOKEN, enumerations), in which case the method passes Value s back, where s is the normalized string value of the attribute. The other types (IDREFS, ENTITIES, NMTOKENS) represent list values, and the parser splits the XML literal into several tokens and returns these tokens as Valuelist sl.

    Normalization means that entity references (the &name; tokens) and character references (&#number;) are replaced by the text they represent, and that white space characters are converted into plain spaces.

  • n # data: Returns the character data contained in the node. For data nodes, the meaning is obvious as this is the main content of data nodes. For element nodes, this method returns the concatenated contents of all inner data nodes.

    Note that entity references included in the text are resolved while they are being parsed; for example the text "a &lt;&gt; b" will be returned as "a <> b" by this method. Spaces of data nodes are always preserved. Newlines are preserved, but always converted to \n characters even if newlines are encoded as \r\n or \r. Normally you will never see two adjacent data nodes because the parser collapses all data material at one location into one node. (However, if you create your own tree or transform the parsed tree, it is possible to have adjacent data nodes.)

    Note that elements that do not allow #PCDATA as content will not have data nodes as children. This means that spaces and newlines, the only character material allowed for such elements, are silently dropped.

For example, if the task is to print all contents of elements with type "valuable" whose attribute "priority" is "1", this function can help:
let rec print_valuable_prio1 n =
  let ntype = n # node_type in
  match ntype with
    T_element "valuable" when n # attribute "priority" = Value "1" ->
      print_endline "Valuable node with priotity 1 found:";
      print_endline (n # data)
  | (T_element _ | T_data) ->
      let children = n # sub_nodes in
      List.iter print_valuable_prio1 children
  | _ ->
      assert false
You can call this function by:
print_valuable_prio1 root
If you like a DSSSL-like style, you can make the function process_children explicit:
let rec print_valuable_prio1 n =

  let process_children n =
    let children = n # sub_nodes in
    List.iter print_valuable_prio1 children 
  in

  let ntype = n # node_type in
  match ntype with
    T_element "valuable" when n # attribute "priority" = Value "1" ->
      print_endline "Valuable node with priority 1 found:";
      print_endline (n # data)
  | (T_element _ | T_data) ->
      process_children n
  | _ ->
      assert false
So far, O'Caml is now a simple "style-sheet language": You can form a big "match" expression to distinguish between all significant cases, and provide different reactions on different conditions. But this technique has limitations; the "match" expression tends to get larger and larger, and it is difficult to store intermediate values as there is only one big recursion. Alternatively, it is also possible to represent the various cases as classes, and to use dynamic method lookup to find the appropiate class. The next section explains this technique in detail.

Notes

[1]

Elements may also contain processing instructions. Unlike other document models, PXP separates processing instructions from the rest of the text and provides a second interface to access them (method pinstr). However, there is a parser option (enable_pinstr_nodes) which changes the behaviour of the parser such that extra nodes for processing instructions are included into the tree.

Furthermore, the tree does normally not contain nodes for XML comments; they are ignored by default. Again, there is an option (enable_comment_nodes) changing this.

[2]

Due to the typing system it is more or less impossible to derive recursive classes in O'Caml. To get around this, it is common practice to put the modifiable or extensible part of recursive objects into parallel objects.

This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml