Plasma GitLab Archive
Projects Blog Knowledge

2.3. Class-based processing of the node tree

By default, the parsed node tree consists of objects of the same class; this is a good design as long as you want only to access selected parts of the document. For complex transformations, it may be better to use different classes for objects describing different element types.

For example, if the DTD declares the element types a, b, and c, and if the task is to convert an arbitrary document into a printable format, the idea is to define for every element type a separate class that has a method print. The classes are eltype_a, eltype_b, and eltype_c, and every class implements print such that elements of the type corresponding to the class are converted to the output format.

The parser supports such a design directly. As it is impossible to derive recursive classes in O'Caml[1], the specialized element classes cannot be formed by simply inheriting from the built-in classes of the parser and adding methods for customized functionality. To get around this limitation, every node of the document tree is represented by two objects, one called "the node" and containing the recursive definition of the tree, one called "the extension". Every node object has a reference to the extension, and the extension has a reference to the node. The advantage of this model is that it is now possible to customize the extension without affecting the typing constraints of the recursive node definition.

Every extension must have the three methods clone, node, and set_node. The method clone creates a deep copy of the extension object and returns it; node returns the node object for this extension object; and set_node is used to tell the extension object which node is associated with it, this method is automatically called when the node tree is initialized. The following definition is a good starting point for these methods; usually clone must be further refined when instance variables are added to the class:

class custom_extension =
  object (self)

    val mutable node = (None : custom_extension node option)

    method clone = {< >} 
    method node =
      match node with
          None ->
            assert false
        | Some n -> n
    method set_node n =
      node <- Some n

  end
This part of the extension is usually the same for all classes, so it is a good idea to consider custom_extension as the super-class of the further class definitions. Continuining the example of above, we can define the element type classes as follows:
class virtual custom_extension =
  object (self)
    ... clone, node, set_node defined as above ...

    method virtual print : out_channel -> unit
  end

class eltype_a =
  object (self)
    inherit custom_extension
    method print ch = ...
  end

class eltype_b =
  object (self)
    inherit custom_extension
    method print ch = ...
  end

class eltype_c =
  object (self)
    inherit custom_extension
    method print ch = ...
  end
The method print can now be implemented for every element type separately. Note that you get the associated node by invoking
self # node
and you get the extension object of a node n by writing
n # extension
It is guaranteed that
self # node # extension == self
always holds.

Here are sample definitions of the print methods:

class eltype_a =
  object (self)
    inherit custom_extension
    method print ch = 
      (* Nodes <a>...</a> are only containers: *)
      output_string ch "(";
      List.iter
        (fun n -> n # extension # print ch)
        (self # node # sub_nodes);
      output_string ch ")";
  end

class eltype_b =
  object (self)
    inherit custom_extension
    method print ch =
      (* Print the value of the CDATA attribute "print": *)
      match self # node # attribute "print" with
        Value s       -> output_string ch s
      | Implied_value -> output_string ch "<missing>"
      | Valuelist l   -> assert false   
                         (* not possible because the att is CDATA *)
  end

class eltype_c =
  object (self)
    inherit custom_extension
    method print ch = 
      (* Print the contents of this element: *)
      output_string ch (self # node # data)
  end

class null_extension =
  object (self)
    inherit custom_extension
    method print ch = assert false
  end

The remaining task is to configure the parser such that these extension classes are actually used. Here another problem arises: It is not possible to dynamically select the class of an object to be created. As workaround, PXP allows the user to specify exemplar objects for the various element types; instead of creating the nodes of the tree by applying the new operator the nodes are produced by duplicating the exemplars. As object duplication preserves the class of the object, one can create fresh objects of every class for which previously an exemplar has been registered.

Exemplars are meant as objects without contents, the only interesting thing is that exemplars are instances of a certain class. The creation of an exemplar for an element node can be done by:

let element_exemplar = new element_impl extension_exemplar
And a data node exemplar is created by:
let data_exemplar = new data_impl extension_exemplar
The classes element_impl and data_impl are defined in the module Pxp_document. The constructors initialize the fresh objects as empty objects, i.e. without children, without data contents, and so on. The extension_exemplar is the initial extension object the exemplars are associated with.

Once the exemplars are created and stored somewhere (e.g. in a hash table), you can take an exemplar and create a concrete instance (with contents) by duplicating it. As user of the parser you are normally not concerned with this as this is part of the internal logic of the parser, but as background knowledge it is worthwhile to mention that the two methods create_element and create_data actually perform the duplication of the exemplar for which they are invoked, additionally apply modifications to the clone, and finally return the new object. Moreover, the extension object is copied, too, and the new node object is associated with the fresh extension object. Note that this is the reason why every extension object must have a clone method.

The configuration of the set of exemplars is passed to the parse_document_entity function as third argument. In our example, this argument can be set up as follows:

let spec =
  make_spec_from_alist
    ~data_exemplar:            (new data_impl (new null_extension))
    ~default_element_exemplar: (new element_impl (new null_extension))
    ~element_alist:
       [ "a",  new element_impl (new eltype_a);
         "b",  new element_impl (new eltype_b);
         "c",  new element_impl (new eltype_c);
       ]
    ()
The ~element_alist function argument defines the mapping from element types to exemplars as associative list. The argument ~data_exemplar specifies the exemplar for data nodes, and the ~default_element_exemplar is used whenever the parser finds an element type for which the associative list does not define an exemplar.

The configuration is now complete. You can still use the same parsing functions, only the initialization is a bit different. For example, call the parser by:

let d = parse_document_entity default_config (from_file "doc.xml") spec
Note that the resulting document d has a usable type; especially the print method we added is visible. So you can print your document by
d # root # extension # print stdout

This object-oriented approach looks rather complicated; this is mostly caused by working around some problems of the strict typing system of O'Caml. Some auxiliary concepts such as extensions were needed, but the practical consequences are low. In the next section, one of the examples of the distribution is explained, a converter from readme documents to HTML.

Notes

[1]

The problem is that the subclass is usually not a subtype in this case because O'Caml has a contravariant subtyping rule.

This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml