Plasma GitLab Archive
Projects Blog Knowledge


The readme processor

The task of the readme processor is to convert a document conforming to the XML DTD "readme.dtd" into an HTML document or a text document. This example especially demonstrates how to use node extensions to add custom methods to nodes (see Intro_extensions), and how to use the object-oriented feature of late binding so that every node type behaves differently.

Note that the converter code dates back from 1999. Nowadays I would probably have written it as a purely functional transformer. This task is now left to the reader...

The readme DTD

The motivation for readme was that I often wrote two versions of files such as README and INSTALL explaining aspects of a distributed software archive; one version was ASCII-formatted, the other was written in HTML. Maintaining both versions meant double amount of work, and changes of one version could be forgotten in the other version. To improve this situation I invented the readme DTD which allows me to maintain only one source written as XML document, and to generate both the ASCII and the HTML version from it

In this section, I explain only the DTD. The readme DTD is contained in the PXP distribution together with the two converters to produce ASCII and HTML.

The documents have a simple structure: There are up to three levels of nested sections, paragraphs, item lists, footnotes, hyperlinks, and text emphasis. The outermost element has usually the type readme, it is declared by

<!ELEMENT readme (sect1+)>
<!ATTLIST readme
          title CDATA #REQUIRED>

This means that this element contains one or more sections of the first level (element type sect1), and that the element has a required attribute title containing character data (CDATA). Note that readme elements must not contain text data.

The three levels of sections are declared as follows:

<!ELEMENT sect1 (title,(sect2|p|ul)+)>

<!ELEMENT sect2 (title,(sect3|p|ul)+)>

<!ELEMENT sect3 (title,(p|ul)+)>

Every section has a title element as first subelement. After the title an arbitrary but non-empty sequence of inner sections, paragraphs and item lists follows. Note that the inner sections must belong to the next higher section level; sect3 elements must not contain inner sections because there is no next higher level.

Obviously, all three declarations allow paragraphs (p) and item lists (ul). The definition can be simplified at this point by using a parameter entity:

<!ENTITY % "p|ul">

<!ELEMENT sect1 (title,(sect2|;)+)>

<!ELEMENT sect2 (title,(sect3|;)+)>

<!ELEMENT sect3 (title,(;)+)>

Here, the entity is nothing but a macro abbreviating the same sequence of declarations; if new elements on the same level as p and ul are later added, it is sufficient only to change the entity definition. Note that there are some restrictions on the usage of entities in this context; most important, entities containing a left parenthesis must also contain the corresponding right parenthesis.

Note that the entity is a parameter entity, i.e. the ENTITY declaration contains a percent sign, and the entity is referred to by;. This kind of entity must be used to abbreviate parts of the DTD; the general entities declared without percent sign and referred to as &amp;name; are not allowed in this context.

The title element specifies the title of the section in which it occurs. The title is given as character data, optionally interspersed with line breaks (br):

<!ELEMENT title (#PCDATA|br)*>

Compared with the title attribute of the readme element, this element allows inner markup (i.e. br) while attribute values do not: It is an error if an attribute value contains the left angle bracket < literally such that it is impossible to include inner elements.

The paragraph element p has a structure similar to title, but it allows more inner elements:

<!ENTITY % text "br|code|em|footnote|a">

<!ELEMENT p (#PCDATA|%text;)*>

Line breaks do not have inner structure, so they are declared as being empty:


This means that really nothing is allowed within br; you must always write <br></br> or abbreviated <br/>.

Code samples should be marked up by the code tag; emphasized text can be indicated by em:


<!ELEMENT em (#PCDATA|%text;)*>

That code elements are not allowed to contain further markup while em elements do is a design decision by the author of the DTD.

Unordered lists simply consists of one or more list items, and a list item may contain paragraph-level material:

<!ELEMENT ul (li+)>

<!ELEMENT li (;)*>

Footnotes are described by the text of the note; this text may contain text-level markup. There is no mechanism to describe the numbering scheme of footnotes, or to specify how footnote references are printed.

<!ELEMENT footnote (#PCDATA|%text;)*>

Hyperlinks are written as in HTML. The anchor tag contains the text describing where the link points to, and the href attribute is the pointer (as URL). There is no way to describe locations of "hash marks". If the link refers to another readme document, the attribute readmeref should be used instead of href. The reason is that the converted document has usually a different system identifier (file name), and the link to a converted document must be converted, too.

          href      CDATA #IMPLIED
          readmeref CDATA #IMPLIED

Note that although it is only sensible to specify one of the two attributes, the DTD has no means to express this restriction.

So far the DTD. Finally, here is a sample document for it:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE readme SYSTEM "readme.dtd">
<readme title="How to use the readme converters">
      The <em>readme</em> converter is invoked on the command line by:
      <code>readme [ -text | -html ] input.xml</code>
      Here is a list of options:
          specifies that ASCII output should be produced</p>
           specifies that HTML output should be produced</p>
      The input file must be given on the command line. The converted output is
      printed to <em>stdout</em>.
      The program has been written by
      <a href="">Gerd Stolpmann</a>.

The readme converter to HTML

The converter from readme documents to HTML documents follows strictly the approach to define one extension class per element type. The generated HTML code is structurally similar to the readme source, because of this most elements can be converted in the following straigh-forward way: Given the input element


the conversion text is the concatenation of a computed prefix, the recursively converted content, and a computed suffix.

Only one element type cannot be handled by this scheme: footnote. Footnotes are collected while they are found in the input text, and they are printed after the main text has been converted and printed.

Now we comment to source code of the converter.


open Pxp_types
open Pxp_document
open Pxp_dtd.Entity

Type declarations

class type footnote_printer =
    method footnote_to_html : store_type -> out_channel -> unit

and store_type =
    method alloc_footnote : footnote_printer -> int
    method print_footnotes : out_channel -> unit

For comments see the implementations below.

The class store of type store_type

The store is a container for footnotes. You can add a footnote by invoking alloc_footnote; the argument is an object of the class footnote_printer, the method returns the number of the footnote. The interesting property of a footnote is that it can be converted to HTML, so a footnote_printer is an object with a method footnote_to_html. The class footnote which is defined below has a compatible method footnote_to_html such that objects created from it can be used as footnote_printers.

The other method, print_footnotes prints the footnotes as definition list, and is typically invoked after the main material of the page has already been printed. Every item of the list is printed by footnote_to_html.

class store =
  object (self)

    val mutable footnotes = ( [] : (int * footnote_printer) list )
    val mutable next_footnote_number = 1

    method alloc_footnote n =
      let number = next_footnote_number in
      next_footnote_number <- number+1;
      footnotes <- footnotes @ [ number, n ];

    method print_footnotes ch =
      if footnotes <> [] then begin
        output_string ch "<hr align=left noshade=noshade width=\"30%\">\n";
        output_string ch "<dl>\n";
          (fun (_,n) -> 
             n # footnote_to_html (self : #store_type :> store_type) ch)
        output_string ch "</dl>\n";


The function escape_html

This function converts the characters <, >, &, and " to their HTML representations. For example,

 escape_html "<>" = "&lt;&gt;" 

Other characters are left unchanged.

let escape_html s =
    (Str.regexp "<\\|>\\|&\\|\"\\|@\\|:")
    (fun s ->
      match Str.matched_string s with
        "<" -> "&lt;"
      | ">" -> "&gt;"
      | "&" -> "&amp;"
      | "\"" -> "&quot;"
      | "@" -> "&#64;"
      | ":" -> "&#58;"
      | _ -> assert false)

Note (of 2009): There is also the Ocamlnet function Netencoding.Html.encode one can use. It has a special XML mode.

The virtual class shared

This virtual class is the abstract superclass of the extension classes shown below. It defines the standard methods clone, node, and set_node, and declares the type of the virtual method to_html. This method recursively traverses the whole element tree, and prints the converted HTML code to the output channel passed as second argument. The first argument is the reference to the global store object which collects the footnotes.

class virtual shared =
  object (self)

    (* --- default_ext --- *)

    val mutable node = (None : shared node option)

    method clone = {< >} 
    method node =
      match node with
          None ->
            assert false
        | Some n -> n
    method set_node n =
      node <- Some n

    (* --- virtual --- *)

    method virtual to_html : store -> out_channel -> unit


For an introduction into extension classes see Intro_extensions.

The class only_data

This class defines to_html such that the character data of the current node is converted to HTML. Note that self is an extension object (of type Pxp_document.extension), self # node is the node object (of type Pxp_document.node), and self # node # data returns the character data of the node (see

class only_data =
  object (self)
    inherit shared

    method to_html store ch =
      output_string ch (escape_html (self # node # data))

The class readme

This class converts elements of type readme to HTML. Such an element is (by definition) always the root element of the document. First, the HTML header is printed; the title attribute of the element determines the title of the HTML page. Some aspects of the HTML page can be configured by setting certain parameter entities, for example the background color, the text color, and link colors. After the header, the body tag, and the headline have been printed, the contents of the page are converted by invoking to_html on all children of the current node (which is the root node). Then, the footnotes are appended to this by telling the global store object to print the footnotes. Finally, the end tags of the HTML pages are printed.

class readme =
  object (self)
    inherit shared

    method to_html store ch =
      (* output header *)
        ch "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">";
        ch "<!-- WARNING! This is a generated file, do not edit! -->\n";
      let title = 
        match self # node # attribute "title" with
            Value s -> s
          | _ -> assert false
      let html_header =
        try replacement_text
            (self # node # dtd # par_entity "readme:html:header"
        with WF_error _ -> "" in
      let html_trailer =
        try replacement_text
            (self # node # dtd # par_entity "readme:html:trailer")
        with WF_error _ -> "" in
      let html_bgcolor =
        try replacement_text
            (self # node # dtd # par_entity "readme:html:bgcolor")
        with WF_error _ -> "white" in
      let html_textcolor =
        try replacement_text
            (self # node # dtd # par_entity "readme:html:textcolor")
        with WF_error _ -> "" in
      let html_alinkcolor =
        try replacement_text
            (self # node # dtd # par_entity "readme:html:alinkcolor")
        with WF_error _ -> "" in
      let html_vlinkcolor =
        try replacement_text
            (self # node # dtd # par_entity "readme:html:vlinkcolor")
        with WF_error _ -> "" in
      let html_linkcolor =
        try replacement_text 
            (self # node # dtd # par_entity "readme:html:linkcolor")
        with WF_error _ -> "" in
      let html_background =
        try replacement_text
            (self # node # dtd # par_entity "readme:html:background")
        with WF_error _ -> "" in

      output_string ch "<html><header><title>\n";
      output_string ch (escape_html title);
      output_string ch "</title></header>\n";
      output_string ch "<body ";
        (fun (name,value) ->
           if value <> "" then 
             output_string ch (name ^ "=\"" ^ escape_html value ^ "\" "))
        [ "bgcolor",    html_bgcolor;
          "text",       html_textcolor;
          "link",       html_linkcolor;
          "alink",      html_alinkcolor;
          "vlink",      html_vlinkcolor;
      output_string ch ">\n";
      output_string ch html_header;
      output_string ch "<h1>";
      output_string ch (escape_html title);
      output_string ch "</h1>\n";
      (* process main content: *)
        (fun n -> n # extension # to_html store ch)
        (self # node # sub_nodes);
      (* now process footnotes *)
      store # print_footnotes ch;
      (* trailer *)
      output_string ch html_trailer;
      output_string ch "</html>\n";

This class is an example how to access the value of an attribute: The value is determined by invoking self # node # attribute "title" (see Pxp_document.node.attribute). As this attribute has been declared as CDATA and as being required, the value has always the form Value s where s is the string value of the attribute. Attribute values have type Pxp_types.att_value.

You can also see how entity contents can be accessed. A parameter entity object can be looked up by self # node # dtd # par_entity "name" (see Pxp_dtd.dtd.par_entity), and by invoking Pxp_dtd.Entity.replacement_text the value of the entity is returned after inner parameter and character entities have been processed. Note that you must use Pxp_dtd.dtd.gen_entity instead of par_entity to access general entities.

The classes section, sect1, sect2, and sect3

As the conversion process is very similar, the conversion classes of the three section levels are derived from the more general section class. The HTML code of the section levels only differs in the type of the headline, and because of this the classes describing the section levels can be computed by replacing the class argument the_tag of section by the HTML name of the headline tag.

class section the_tag =
  object (self)
    inherit shared

    val tag = the_tag

    method to_html store ch =
      let sub_nodes = self # node # sub_nodes in
      match sub_nodes with
          title_node :: rest ->
            output_string ch ("<" ^ tag ^ ">\n");
            title_node # extension # to_html store ch;
            output_string ch ("\n</" ^ tag ^ ">");
              (fun n -> n # extension # to_html store ch)
        | _ ->
            assert false

class sect1 = section "h1"
class sect2 = section "h3"
class sect3 = section "h4"

Section elements are converted to HTML by printing a headline and then converting the contents of the element recursively. More precisely, the first sub-element is always a title element, and the other elements are the contents of the section. This structure is declared in the DTD, and it is guaranteed that the document matches the DTD. Because of this the title node can be separated from the rest without any checks.

Both the title node, and the body nodes are then converted to HTML by calling to_html on them.

The classes map_tag, p, em, ul, and li

Several element types are converted to HTML by simply mapping them to corresponding HTML element types. The class map_tag implements this, and the class argument the_target_tag determines the tag name to map to. The output consists of the start tag, the recursively converted inner elements, and the end tag.

class map_tag the_target_tag =
  object (self)
    inherit shared

    val target_tag = the_target_tag

    method to_html store ch =
      output_string ch ("<" ^ target_tag ^ ">\n");
        (fun n -> n # extension # to_html store ch)
        (self # node # sub_nodes);
      output_string ch ("\n</" ^ target_tag ^ ">");

class p = map_tag "p"
class em = map_tag "b"
class ul = map_tag "ul"
class li = map_tag "li"

The class br

Element of type br are mapped to the same HTML type. Note that HTML forbids the end tag of br.

class br =
  object (self)
    inherit shared

    method to_html store ch =
      output_string ch "<br>\n";
        (fun n -> n # extension # to_html store ch)
        (self # node # sub_nodes);

The class code

The code type is converted to a pre section (preformatted text). As the meaning of tabs is unspecified in HTML, tabs are expanded to spaces.

class code =
  object (self)
    inherit shared

    method to_html store ch =
      let data = self # node # data in
      (* convert tabs *)
      let l = String.length data in
      let rec preprocess i column =
        (* this is very ineffective but comprehensible: *)
        if i < l then
          match data.[i] with
              '\t' ->
                let n = 8 - (column mod 8) in
                String.make n ' ' ^ preprocess (i+1) (column + n)
            | '\n' ->
                "\n" ^ preprocess (i+1) 0
            | c ->
                String.make 1 c ^ preprocess (i+1) (column + 1)
      output_string ch "<p><pre>";
      output_string ch (escape_html (preprocess 0 0));
      output_string ch "</pre></p>";

The class a

Hyperlinks, expressed by the a element type, are converted to the HTML a type. If the target of the hyperlink is given by href, the URL of this attribute can be used directly. Alternatively, the target can be given by readmeref in which case the ".html" suffix must be added to the file name.

Note that within a only #PCDATA is allowed, so the contents can be converted directly by applying escape_html to the character data contents.

class a =
  object (self)
    inherit shared

    method to_html store ch =
      output_string ch "<a ";
      let href =
        match self # node # attribute "href" with
            Value v -> escape_html v
          | Valuelist _ -> assert false
          | Implied_value ->
              begin match self # node # attribute "readmeref" with
                  Value v -> escape_html v ^ ".html"
                | Valuelist _ -> assert false
                | Implied_value ->
      if href <> "" then
        output_string ch ("href=\""  ^ href ^ "\"");
      output_string ch ">";
      output_string ch (escape_html (self # node # data));
      output_string ch "</a>";

The class footnote

The footnote class has two methods: to_html to convert the footnote reference to HTML, and footnote_to_html to convert the footnote text itself.

The footnote reference is converted to a local hyperlink; more precisely, to two anchor tags which are connected with each other. The text anchor points to the footnote anchor, and the footnote anchor points to the text anchor.

The footnote must be allocated in the store object. By allocating the footnote, you get the number of the footnote, and the text of the footnote is stored until the end of the HTML page is reached when the footnotes can be printed. The to_html method stores simply the object itself, such that the footnote_to_html method is invoked on the same object that encountered the footnote.

The to_html method only allocates the footnote, and prints the reference anchor, but it does not print nor convert the contents of the note. This is deferred until the footnotes actually get printed, i.e. the recursive call of to_html on the sub nodes is done by footnote_to_html

Note that this technique does not work if you make another footnote within a footnote; the second footnote gets allocated but not printed.

class footnote =
  object (self)
    inherit shared

    val mutable footnote_number = 0

    method to_html store ch =
      let number = 
        store # alloc_footnote (self : #shared :> footnote_printer) in
      let foot_anchor = 
        "footnote" ^ string_of_int number in
      let text_anchor =
        "textnote" ^ string_of_int number in
      footnote_number <- number;
      output_string ch ( "<a name=\"" ^ text_anchor ^ "\" href=\"#" ^ 
                         foot_anchor ^ "\">[" ^ string_of_int number ^ 
                         "]</a>" )

    method footnote_to_html store ch =
      (* prerequisite: we are in a definition list <dl>...</dl> *)
      let foot_anchor = 
        "footnote" ^ string_of_int footnote_number in
      let text_anchor =
        "textnote" ^ string_of_int footnote_number in
      output_string ch ("<dt><a name=\"" ^ foot_anchor ^ "\" href=\"#" ^ 
                        text_anchor ^ "\">[" ^ string_of_int footnote_number ^ 
        (fun n -> n # extension # to_html store ch)
        (self # node # sub_nodes);
      output_string ch ("\n</dd>")

The specification of the document model

This code sets up the hash table that connects element types with the exemplars of the extension classes that convert the elements to HTML. See How to bind extension classes to element types for comments, and Pxp_document.make_spec_from_alist for the function definition.

let tag_map =
    ~data_exemplar:(new data_impl (new only_data))
    ~default_element_exemplar:(new element_impl (new no_markup))
      [ "readme", (new element_impl (new readme));
        "sect1",  (new element_impl (new sect1));
        "sect2",  (new element_impl (new sect2));
        "sect3",  (new element_impl (new sect3));
        "title",  (new element_impl (new no_markup));
        "p",      (new element_impl (new p));
        "br",     (new element_impl (new br));
        "code",   (new element_impl (new code));
        "em",     (new element_impl (new em));
        "ul",     (new element_impl (new ul));
        "li",     (new element_impl (new li));
        "footnote", (new element_impl (new footnote : #shared :> shared));
        "a",      (new element_impl (new a));

The readme converter to ASCII

This converter is quite similar to the HTML converter, and not presented here. The source code is part of the PXP source tarball, however.

The main program

open Pxp_types
open Pxp_document
open Pxp_tree_parser

let rec print_error e =
  prerr_endline(string_of_exn e)

let run f a =
  try f a with
      e -> print_error e

let convert_to_html filename =
  let document =
      { default_config with encoding = `Enc_iso88591 }
      (from_file filename)
  let root = document # root in
  let store = new in
  root # extension # to_html store stdout

let convert_to_text filename =
  let document =
      (from_file filename)
  let root = document # root in
  let store = new in
  let box = new 79 79 in
  root # extension # to_box store box;
  box # output 0 0 stdout

let main() =
  let want_html = ref false in
  let want_text = ref false in
  let filename = ref None in
      [ "-html"Arg.Set want_html, 
              "  convert file to html";
        "-text"Arg.Set want_text,
              "  convert file to text";
      (fun s -> 
         match !filename with
             None -> filename := Some s
           | Some _ ->
               raise (Arg.Bad "Multiple arguments not allowed."))
      "usage: readme [ -text | -html ] input.xml >output";
  let fn =
    match !filename with
        None -> 
          prerr_endline "readme: no input";
          exit 1
      | Some s -> s
  match !want_html, !want_text with
      truefalse ->
        run convert_to_html fn
    | falsetrue ->
        run convert_to_text fn
    | _ ->
        prerr_endline ("readme: Please select exactly one output format")

let () =

This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml