Plasma GitLab Archive
Projects Blog Knowledge

******************************************************************************
The Preprocessor for PXP
******************************************************************************


==============================================================================
The Preprocessor for PXP
==============================================================================

Since PXP-1.1.95, there is a preprocessor as part of the PXP distribution. It 
allows you to compose XML trees and event lists dynamically, which is very 
handy to write XML transformations.

To enable the preprocessor, compile your source files as in: 

ocamlfind ocamlc -syntax camlp4o -package pxp-pp,... ...

The package pxp-pp contains the preprocessor. The -syntax option enables 
camlp4, on which the preprocessor is based. It is also possible to use it 
together with the revised syntax, use "-syntax camlp4r" in this case.

Important: Up to version 1.0.4, findlib (ocamlfind) has a problem with the 
definition for pxp-pp. There is an easy workaround: Use "-syntax camlp4o,byte".

In the toploop, type 

ocaml
# #use "topfind";;
# #camlp4o;;
# #require "pxp-pp";;
# #require "pxp";;



The preprocessor defines the following new syntax notations, explained below in 
detail: 

<:pxp_charset< CHARSET_DECL >>
<:pxp_tree< EXPR >>
<:pxp_vtree< EXPR >>
<:pxp_evlist< EXPR >>
<:pxp_evpull< EXPR >>
<:pxp_text< TEXT >>

The basic notation is "pxp_tree" which creates a tree of PXP document nodes as 
described in EXPR. "pxp_vtree" is the variant where the tree is immediately 
validated. "pxp_evlist" creates a list of PXP events instead of nodes, useful 
together with the event-based parser. "pxp_evpull" is a variation of the 
latter: Instead of an event list an event generator is created that works like 
a pull parser.

The "pxp_charset" notation only configures the character sets to assume. 
Finally, "pxp_text" is a notation for string literals.

------------------------------------------------------------------------------
Creating constant XML
------------------------------------------------------------------------------

The following examples are all written for "pxp_tree". You can also use one of 
the other XML composers instead, but see the notes below.

In order to use "pxp_tree", you must define two variables in the environment: 
"spec" and "dtd": 

let spec = Pxp_tree_parser.default_spec;;
let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;

These variables occur in the code generated by the preprocessor. The "dtd" 
variable is the DTD object. Note that you need it even in well-formedness mode 
(validation turned off). The "spec" variable controls which classes are 
instantiated as node representation (see PXP manual).

Now you can create XML trees like in 

let book = 
  <:pxp_tree< 
    <book>
      [ <title>[ "The Lord of The Rings" ]
        <author>[ "J.R.R. Tolkien" ]
      ]
  >>

As you can see, the syntax is somehow XML-related but not really XML. (Many 
ideas are borrowed from CDUCE, by the way.) In particular, there are start tags 
like <title> but no end tags. Instead, we are using square brackets to denote 
the children of an XML element. Furthermore, character data must be put into 
double quotes.

You may ask why the well-known XML syntax has been modified for this 
preprocessor. There are many reasons, and they will become clearer in the 
following explanations. For now, you can see the advantage that the syntax is 
less verbose, as you need not to repeat the element names in end tags. 
Furthermore, you can exactly control which characters are part of the data 
nodes without having to make compromises with indentation.

Attributes are written as in XML: 

let book = 
  <:pxp_tree< 
    <book id="BOOK_001">
      [ <title lang="en">[ "The Lord of The Rings" ]
        <author>[ "J.R.R. Tolkien" ]
      ]
  >>



An element without children can be written 

<element>[]

or slightly shorter: 

<element/>



You can also create processing instructions and comment nodes: 

let list =
  <:pxp_tree<
    <list>
      [ <!>"Now the list of books follows!"
        <?>"formatter_directive" "one book per page"
        book
      ]
 >>

The notation "<!>" creates a comment node with the following string as 
contents. The notation "<?>" needs two strings, first the target, then the 
value (here, this results in "<?formatter_directive one book per page?>". 

Look again at the last example: The O'Caml variable "book" occurs, and it 
inserts its tree into the list of books. Identifiers without "decoration" just 
refer to O'Caml variables. We will see more examples below.

The preprocessor syntax knows a number of shortcuts and variations. First, you 
can omit the square brackets when an element has exactly one child: 

<element><child>"Data inside child"

This is the same as 

<element>[ <child>[ "Data inside child" ] ]

Second, you are already used to a common abbreviation: Strings are 
automatically converted to data nodes. The "expanded" syntax is 

<*>"Data string"

where "<*>" denotes a data node, and the following string is used as contents. 
Usually, you can omit "<*>". However, there are a few occasions where this 
notation is still useful, see below.

In strings, the usual entity references can be used: "Double quotes: &quot;". 
For a newline character, write &#10;.

The preprocessor knows two operators: "^" concatenates strings, and "@" 
concatenates lists. Examples: 

<element>[ "Word1" ^ "Word2" ]
<element>([ <a/> ] @ [ <b/> ])



Parentheses can be used to clarify precedence. For example: 

<element>(l1 @ l2)

Here, the concatenation operator "@" could also be parsed as 

(<element> l1) @ l2

Parentheses may be used in every expression.

Rarely used, there is also a notation for the "super root" nodes (see the PXP 
manual for their meaning): 

<^>[ <element> ... ]



------------------------------------------------------------------------------
Dynamic XML
------------------------------------------------------------------------------

Let us begin with an example. The task is to convert O'Caml values of type 

type book = 
  { title : string;
    author : string;
    isbn : string;
  }

to XML trees like 

<book id="BOOK_'isbn'">
  <title>'title'</title>
  <author>'author'</title>
</book>

(conventional syntax). When b is the book variable, the solution is 

let book = 
  let title = b.title
  and author = b.author
  and isbn = b.isbn in
  <:pxp_tree<
    <book id=("BOOK_" ^ isbn)>
      [ <title><*>title
        <author><*>author
      ]
  >>

First, we bind the simple O'Caml variables "title", "author", and "isbn". The 
reason is that the preprocessor syntax does not allow expressions like 
"b.title" directly in the XML tree (but see below for a better workaround).

The XML tree contains the O'Caml variables. The "id" attribute is a 
concatenation of the fixed prefix "BOOK_" and the contents of "isbn". The 
"title" and "author" elements contain a data node whose contents are the O'Caml 
strings "title", and "author", respectively.

Why "<*>"? If we just wrote "<title>title", the generated code would assume 
that the "title" variable is an XML node, and not a string. From this point of 
view, "<*>" works like a type annotation, as it specialises the type of the 
following expression.

Here is an alternate solution: 

let book = 
  <:pxp_tree<
    <book id=("BOOK_" ^ (: b.isbn :))>
      [ <title><*>(: b.title :)
        <author><*>(: b.author :)
      ]
  >>

The notation "(: ... :)" allows you to include arbitrary O'Caml expressions 
into the tree. In this solution it is no longer necessary to create artificial 
O'Caml variables for the only purpose of injecting values into trees.  

It is possible to create XML elements with dynamic names: Just put parentheses 
around the expression. Example: 

let name = "book" in
<:pxp_tree< <(name)> ... >>

With the same notation, one can also set attribute names dynamically: 

let att_name = "id" in
<:pxp_tree< <book (att_name)=...> ... >>

Finally, it is also possible to include complete attribute lists dynamically: 

let att_list = [ "id", ("BOOK_" ^ b.isbn) ] in
<:pxp_tree< <book (: att_list :) > ... >>



Typing: Depending on where a variable or O'Caml expression occurs, different 
types are assumed. Compare the following examples: 

<:pxp_tree< <element>x1 >>
<:pxp_tree< <element>[x2] >>
<:pxp_tree< <element><*>x3 >>

As a rule of thumb, the most general type is assumed that would make sense at a 
certain location. As x1 could be replaced by a list of children, its type is 
assumed to be a node list. As x2 could be replaced by a single node, its type 
is assumed to be a node. And x3 is a string, we had this case already. 

------------------------------------------------------------------------------
Character Encodings
------------------------------------------------------------------------------

As the preprocessor generates code that builds XML trees, it must know two 
character encodings:

-  Which encoding is used in the source code (in the .ml file) 
   
-  Which encoding is used in the XML representation, i.e. in the O'Caml values 
   representing the XML trees
   
Both encodings can be set independently. The syntax is: 

<:pxp_charset< source="ENC" representation="ENC" >>

The default is ISO-8859-1 for both encodings. For example, to set the 
representation encoding to UTF-8, use: 

<:pxp_charset< representation="UTF-8" >>

The "pxp_charset" notation is a constant expression that always evaluates to 
"()". (A requirement by camlp4 that looks artificial.) 

When you set the representation encoding, it is required that the encoding 
stored in the DTD object is the same. Remember that we need a DTD object like 

let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;

Of course, we must change this to the representation encoding, too, in our 
example: 

let dtd = Pxp_dtd.create_dtd `Enc_utf8;;

The preprocessor cannot check this at compile time, and for performance 
reasons, a runtime check is not generated. So it is up to the programmer that 
the character encodings are used in a consistent way. 

------------------------------------------------------------------------------
Validated Trees
------------------------------------------------------------------------------

In order to validate trees, you need a filled DTD object. In principle, you can 
create this object by a number of methods. For example, you can parse an 
external file: 

let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_file "sample.dtd")

It is, however, often more convenient to include the DTD literally into the 
program. This works by 

let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_string "...")

As the double quotes are often used inside DTDs, O'Caml string literals are a 
bit impractical, as they are also delimited by double quotes, and one needs to 
add backslashes as escape characters. The "pxp_text" notation is often more 
readable here: <:pxp_text<STRING>> is just another way of writing "STRING". In 
our DTD, we have 

let dtd_text =
  <:pxp_text<
    <!ELEMENT book (title,author)>
    <!ATTLIST book id CDATA #REQUIRED>
    <!ELEMENT title (#PCDATA)>
    <!ATTLIST title lang CDATA "en">
    <!ELEMENT author (#PCDATA)>
  >>;;
let config = default_config;;
let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_string dtd_text);;

Note that "pxp_text" is not restricted to DTDs, as it can be used for any kind 
of string.

After we have the DTD, we can validate the trees. One option is to call the 
"validate" function: 

let book = 
  <:pxp_tree< 
    <book>
      [ <title>[ "The Lord of The Rings" ]
        <author>[ "J.R.R. Tolkien" ]
      ]
  >>;;
Pxp_document.validate book;;

(This example is invalid, as the "id" attribute is missing.)

Note that it is a misunderstanding that "pxp_tree" builds XML trees in 
well-formed mode. You can create any tree with it, and the fact is that 
"pxp_tree" just does not invoke the validator. So if the DTD enforces 
validation, the tree is validated when the "validate" function is called. If 
the DTD is in well-formedness mode, the tree is effectively not validated, even 
when the "validate" function is invoked. Btw, the following statements would 
create a DTD in well-formedness mode: 

let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
dtd # allow_arbitrary;

As an alternative of calling the "validate" function, one can also use 
"pxp_vtree" instead. It immediately validates every XML element it creates. 
However, "injected" subtrees are not validated, i.e. validation does not 
proceed recursively to subnodes as the "validate" function does it.

------------------------------------------------------------------------------
Generating Events
------------------------------------------------------------------------------

As PXP has also an event model to represent XML, the preprocessor can also 
produce such events. In particular, there are two modes: The "pxp_evlist" 
notation outputs lists of events (type "event list") representing the XML 
expression. The "pxp_evpull" notation creates an automaton from which one can 
"pull" events (like from a pull parser).

These two notations work very much like "pxp_tree". For example, 

let book = 
  <:pxp_evlist< 
    <book>
      [ <title>[ "The Lord of The Rings" ]
        <author>[ "J.R.R. Tolkien" ]
      ]
  >>

generates 

[ E_start_tag ("book", [], None, <obj>);
  E_start_tag ("title", [], None, <obj>);
  E_char_data "The Lord of The Rings"; 
  E_end_tag ("title", <obj>);
  E_start_tag ("author", [], None, <obj>); 
  E_char_data "J.R.R. Tolkien";
  E_end_tag ("author", <obj>); 
  E_end_tag ("book", <obj>)
]

Note that you neither need a "dtd" variable nor a "spec" variable. There is one 
important difference, however: Both nodes and lists of nodes are represented by 
the same type, "event list". That has the consequence that in the following 
example x1 and x2 have the same type "event list": 

<:pxp_evlist< <element>x1 >>
<:pxp_evlist< <element>[x2] >>
<:pxp_evlist< <element><*>x3 >>

In principle, it could be checked at runtime whether x1 and x2 have the right 
structure. However, this is not done because of performance reasons.

As mentioned, "pxp_evpull" works like a pull parser. After defining 

let book = 
  <:pxp_evpull< 
    <book>
      [ <title>[ "The Lord of The Rings" ]
        <author>[ "J.R.R. Tolkien" ]
      ]
  >>

"book" is a function 'a->event. One can call it to get the events one after the 
other: 

let e1 = book();;       (* = Some(E_start_tag ("book", [], None, <obj>)) *)
let e2 = book();;       (* = Some(E_start_tag ("title", [], None, <obj>)) *)
...

After the last event, "book" returns None to indicate the end of the event 
stream.

As for "pxp_evlist", it is not possible to distinguish between nodes and node 
lists. In this example, both x1 and x2 are assumed to have type 'a->event: 

<:pxp_evlist< <element>x1 >>
<:pxp_evlist< <element>[x2] >>
<:pxp_evlist< <element><*>x3 >>

Note that "<element>x1" actually means to build a new pull automaton around the 
existing pull automaton x1: The children of "element" are retrieved by pulling 
events from x1 until "None" is returned.

A consequence of the pull semantics is that once an event is obtained from an 
automaton, the state of the automaton is modified such that it is not possible 
to get the same event again. If you need an automaton that can be reset to the 
beginning, just wrap the "pxp_evlist" notation into a functional abstraction: 

let book_maker() =
  <:pxp_evpull< <book ...> ... >>;;
let book1 = book_maker();;
let book2 = book_maker();;

This way, "book1" and "book2" are independent event streams.

There is another implication of the nature of the automatons: Subexpressions 
are lazily evaluated. For example, in 

<:pxp_evpull< <element>[ <*> (: get_data_contents() :) ] >>

the call of get_data_contents is performed just before the event for the data 
node is constructed.

------------------------------------------------------------------------------
Namespaces
------------------------------------------------------------------------------

By default, the preprocessor does not generate nodes or events that support 
namespaces. It can, however, be configured to create namespace-aware XML 
aggregations.  

In any case, you need a namespace manager. This is an object that tracks the 
usage of namespace prefixes in XML nodes. For example, we can create a 
namespace manager that knows the "html" prefix: 

let mng = new namespace_manager in
mng # add_namespace "html" "http://www.w3.org/1999/xhtml"

Here, we declare that we want to use the "html" prefix for the internal 
representation of the XML nodes. This kind of prefix is called normalized 
prefix, or normprefix for short. It is possible to configure different prefixes 
for the external representation, i.e. when the XML tree is printed to a file. 
This other kind of prefix is called display prefix. We will have a look at them 
later.

Next, we must tell the DTD object that we have a namespace manager: 

let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
dtd # set_namespace_manager mng;;



For "pxp_evlist" and "pxp_evpull" we are now prepared (note that we need now a 
"dtd" variable, as the DTD object knows the namespace manager). For "pxp_tree" 
and "pxp_vtree", it is required to use a namespace-aware specification: 

let spec = Pxp_tree_parser.default_namespace_spec 

(Normal specifications do not work, you would get "Namespace method not 
applicable" errors if you tried to use them.)

The special notation "<:autoscope>" enables namespace mode in this example: 

let list =
  <:pxp_tree<
    <:autoscope>
      <html:ul>
        [ <html:li>"Item1"
          <html:li>"Item2"
        ]
  >>

In particular, "<:autoscope>" defines a new O'Caml variable for its 
subexpression: "scope". This variable contains the namespace scope object, 
which contains the namespace declarations for the subexpression. "<:autoscope>" 
initialises this variable from the namespace manager such that it contains now 
a declaration for the "html" prefix.

In general, the namespace scope object contains the prefixes to use for the 
external representation. For this simple example, we have chosen to use the 
same prefixes as for the internal representation, and "<:autoscope>" performs 
the right initialisations for this.

Print the tree by 

list # display (`Out_channel stdout) `Enc_iso88591

The point is to call the "display" method and not the "write" method. The 
latter would not respect the display prefixes.  

Alternatively, we can also create the "scope" variable manually: 

let scope = Pxp_dtd.create_namespace_scope
              ~decl:[ "", "http://www.w3.org/1999/xhtml" ]
              mng;;
let list =
  <:pxp_tree<
    <:scope>
      <html:ul>
        [ <html:li>"Item1"
          <html:li>"Item2"
        ]
  >>

Note that we now use "<:scope>". In this simple form, this construct just 
enables namespace mode, and takes the "scope" variable from the environment.

Furthermore, the namespace scope contains now a different namespace 
declaration: The display prefix "" is used for HTML. The empty prefix just 
means to declare a default prefix (by xmlns="URI"). The effect can be seen when 
the XML tree is printed by calling the "display" method.

Here is a third variant of the same example: 

let scope = Pxp_dtd.create_namespace_scope mng ;;
let list =
  <:pxp_tree<
    <:scope ("")="http://www.w3.org/1999/xhtml">
      <html:ul>
        [ <html:li>"Item1"
          <html:li>"Item2"
        ]
  >>

The "scope" is now initially empty. The "<:scope>" notation is used to extend 
the scope for the time the subexpression is evaluated.

There is also a notation "<:emptyscope" that creates an empty scope object, so 
one could even write 

let list =
  <:pxp_tree<
    <:emptyscope>
      <:scope ("")="http://www.w3.org/1999/xhtml">
        <html:ul>
          [ <html:li>"Item1"
            <html:li>"Item2"
          ]
  >>



It is recommended to create the "scope" variable manually with a reasonable 
initial declaration, and to use "<:scope>" to enable namespace processing, and 
to extend the scope when necessary. The advantage of this approach is that the 
same scope object can be shared by many XML nodes, so you need less memory.

One tip: To get a namespace scope that is initialised with all prefixes of the 
namespace manager (as <:autoscope> does it), define 

let scope = create_namespace_scope ~decl: mng#as_declaration mng



For event-based processing of XML, the namespace mode works in the same way as 
described here, there is no difference.


This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml