****************************************************************************** The Preprocessor for PXP ****************************************************************************** ============================================================================== The Preprocessor for PXP ============================================================================== Since PXP-1.1.95, there is a preprocessor as part of the PXP distribution. It allows you to compose XML trees and event lists dynamically, which is very handy to write XML transformations. To enable the preprocessor, compile your source files as in: ocamlfind ocamlc -syntax camlp4o -package pxp-pp,... ... The package pxp-pp contains the preprocessor. The -syntax option enables camlp4, on which the preprocessor is based. It is also possible to use it together with the revised syntax, use "-syntax camlp4r" in this case. Important: Up to version 1.0.4, findlib (ocamlfind) has a problem with the definition for pxp-pp. There is an easy workaround: Use "-syntax camlp4o,byte". In the toploop, type ocaml # #use "topfind";; # #camlp4o;; # #require "pxp-pp";; # #require "pxp";; The preprocessor defines the following new syntax notations, explained below in detail: <:pxp_charset< CHARSET_DECL >> <:pxp_tree< EXPR >> <:pxp_vtree< EXPR >> <:pxp_evlist< EXPR >> <:pxp_evpull< EXPR >> <:pxp_text< TEXT >> The basic notation is "pxp_tree" which creates a tree of PXP document nodes as described in EXPR. "pxp_vtree" is the variant where the tree is immediately validated. "pxp_evlist" creates a list of PXP events instead of nodes, useful together with the event-based parser. "pxp_evpull" is a variation of the latter: Instead of an event list an event generator is created that works like a pull parser. The "pxp_charset" notation only configures the character sets to assume. Finally, "pxp_text" is a notation for string literals. ------------------------------------------------------------------------------ Creating constant XML ------------------------------------------------------------------------------ The following examples are all written for "pxp_tree". You can also use one of the other XML composers instead, but see the notes below. In order to use "pxp_tree", you must define two variables in the environment: "spec" and "dtd": let spec = Pxp_tree_parser.default_spec;; let dtd = Pxp_dtd.create_dtd `Enc_iso88591;; These variables occur in the code generated by the preprocessor. The "dtd" variable is the DTD object. Note that you need it even in well-formedness mode (validation turned off). The "spec" variable controls which classes are instantiated as node representation (see PXP manual). Now you can create XML trees like in let book = <:pxp_tree< <book> [ <title>[ "The Lord of The Rings" ] <author>[ "J.R.R. Tolkien" ] ] >> As you can see, the syntax is somehow XML-related but not really XML. (Many ideas are borrowed from CDUCE, by the way.) In particular, there are start tags like <title> but no end tags. Instead, we are using square brackets to denote the children of an XML element. Furthermore, character data must be put into double quotes. You may ask why the well-known XML syntax has been modified for this preprocessor. There are many reasons, and they will become clearer in the following explanations. For now, you can see the advantage that the syntax is less verbose, as you need not to repeat the element names in end tags. Furthermore, you can exactly control which characters are part of the data nodes without having to make compromises with indentation. Attributes are written as in XML: let book = <:pxp_tree< <book id="BOOK_001"> [ <title lang="en">[ "The Lord of The Rings" ] <author>[ "J.R.R. Tolkien" ] ] >> An element without children can be written <element>[] or slightly shorter: <element/> You can also create processing instructions and comment nodes: let list = <:pxp_tree< <list> [ <!>"Now the list of books follows!" <?>"formatter_directive" "one book per page" book ] >> The notation "<!>" creates a comment node with the following string as contents. The notation "<?>" needs two strings, first the target, then the value (here, this results in "<?formatter_directive one book per page?>". Look again at the last example: The O'Caml variable "book" occurs, and it inserts its tree into the list of books. Identifiers without "decoration" just refer to O'Caml variables. We will see more examples below. The preprocessor syntax knows a number of shortcuts and variations. First, you can omit the square brackets when an element has exactly one child: <element><child>"Data inside child" This is the same as <element>[ <child>[ "Data inside child" ] ] Second, you are already used to a common abbreviation: Strings are automatically converted to data nodes. The "expanded" syntax is <*>"Data string" where "<*>" denotes a data node, and the following string is used as contents. Usually, you can omit "<*>". However, there are a few occasions where this notation is still useful, see below. In strings, the usual entity references can be used: "Double quotes: "". For a newline character, write . The preprocessor knows two operators: "^" concatenates strings, and "@" concatenates lists. Examples: <element>[ "Word1" ^ "Word2" ] <element>([ <a/> ] @ [ <b/> ]) Parentheses can be used to clarify precedence. For example: <element>(l1 @ l2) Here, the concatenation operator "@" could also be parsed as (<element> l1) @ l2 Parentheses may be used in every expression. Rarely used, there is also a notation for the "super root" nodes (see the PXP manual for their meaning): <^>[ <element> ... ] ------------------------------------------------------------------------------ Dynamic XML ------------------------------------------------------------------------------ Let us begin with an example. The task is to convert O'Caml values of type type book = { title : string; author : string; isbn : string; } to XML trees like <book id="BOOK_'isbn'"> <title>'title'</title> <author>'author'</title> </book> (conventional syntax). When b is the book variable, the solution is let book = let title = b.title and author = b.author and isbn = b.isbn in <:pxp_tree< <book id=("BOOK_" ^ isbn)> [ <title><*>title <author><*>author ] >> First, we bind the simple O'Caml variables "title", "author", and "isbn". The reason is that the preprocessor syntax does not allow expressions like "b.title" directly in the XML tree (but see below for a better workaround). The XML tree contains the O'Caml variables. The "id" attribute is a concatenation of the fixed prefix "BOOK_" and the contents of "isbn". The "title" and "author" elements contain a data node whose contents are the O'Caml strings "title", and "author", respectively. Why "<*>"? If we just wrote "<title>title", the generated code would assume that the "title" variable is an XML node, and not a string. From this point of view, "<*>" works like a type annotation, as it specialises the type of the following expression. Here is an alternate solution: let book = <:pxp_tree< <book id=("BOOK_" ^ (: b.isbn :))> [ <title><*>(: b.title :) <author><*>(: b.author :) ] >> The notation "(: ... :)" allows you to include arbitrary O'Caml expressions into the tree. In this solution it is no longer necessary to create artificial O'Caml variables for the only purpose of injecting values into trees. It is possible to create XML elements with dynamic names: Just put parentheses around the expression. Example: let name = "book" in <:pxp_tree< <(name)> ... >> With the same notation, one can also set attribute names dynamically: let att_name = "id" in <:pxp_tree< <book (att_name)=...> ... >> Finally, it is also possible to include complete attribute lists dynamically: let att_list = [ "id", ("BOOK_" ^ b.isbn) ] in <:pxp_tree< <book (: att_list :) > ... >> Typing: Depending on where a variable or O'Caml expression occurs, different types are assumed. Compare the following examples: <:pxp_tree< <element>x1 >> <:pxp_tree< <element>[x2] >> <:pxp_tree< <element><*>x3 >> As a rule of thumb, the most general type is assumed that would make sense at a certain location. As x1 could be replaced by a list of children, its type is assumed to be a node list. As x2 could be replaced by a single node, its type is assumed to be a node. And x3 is a string, we had this case already. ------------------------------------------------------------------------------ Character Encodings ------------------------------------------------------------------------------ As the preprocessor generates code that builds XML trees, it must know two character encodings: - Which encoding is used in the source code (in the .ml file) - Which encoding is used in the XML representation, i.e. in the O'Caml values representing the XML trees Both encodings can be set independently. The syntax is: <:pxp_charset< source="ENC" representation="ENC" >> The default is ISO-8859-1 for both encodings. For example, to set the representation encoding to UTF-8, use: <:pxp_charset< representation="UTF-8" >> The "pxp_charset" notation is a constant expression that always evaluates to "()". (A requirement by camlp4 that looks artificial.) When you set the representation encoding, it is required that the encoding stored in the DTD object is the same. Remember that we need a DTD object like let dtd = Pxp_dtd.create_dtd `Enc_iso88591;; Of course, we must change this to the representation encoding, too, in our example: let dtd = Pxp_dtd.create_dtd `Enc_utf8;; The preprocessor cannot check this at compile time, and for performance reasons, a runtime check is not generated. So it is up to the programmer that the character encodings are used in a consistent way. ------------------------------------------------------------------------------ Validated Trees ------------------------------------------------------------------------------ In order to validate trees, you need a filled DTD object. In principle, you can create this object by a number of methods. For example, you can parse an external file: let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_file "sample.dtd") It is, however, often more convenient to include the DTD literally into the program. This works by let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_string "...") As the double quotes are often used inside DTDs, O'Caml string literals are a bit impractical, as they are also delimited by double quotes, and one needs to add backslashes as escape characters. The "pxp_text" notation is often more readable here: <:pxp_text<STRING>> is just another way of writing "STRING". In our DTD, we have let dtd_text = <:pxp_text< <!ELEMENT book (title,author)> <!ATTLIST book id CDATA #REQUIRED> <!ELEMENT title (#PCDATA)> <!ATTLIST title lang CDATA "en"> <!ELEMENT author (#PCDATA)> >>;; let config = default_config;; let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_string dtd_text);; Note that "pxp_text" is not restricted to DTDs, as it can be used for any kind of string. After we have the DTD, we can validate the trees. One option is to call the "validate" function: let book = <:pxp_tree< <book> [ <title>[ "The Lord of The Rings" ] <author>[ "J.R.R. Tolkien" ] ] >>;; Pxp_document.validate book;; (This example is invalid, as the "id" attribute is missing.) Note that it is a misunderstanding that "pxp_tree" builds XML trees in well-formed mode. You can create any tree with it, and the fact is that "pxp_tree" just does not invoke the validator. So if the DTD enforces validation, the tree is validated when the "validate" function is called. If the DTD is in well-formedness mode, the tree is effectively not validated, even when the "validate" function is invoked. Btw, the following statements would create a DTD in well-formedness mode: let dtd = Pxp_dtd.create_dtd `Enc_iso88591;; dtd # allow_arbitrary; As an alternative of calling the "validate" function, one can also use "pxp_vtree" instead. It immediately validates every XML element it creates. However, "injected" subtrees are not validated, i.e. validation does not proceed recursively to subnodes as the "validate" function does it. ------------------------------------------------------------------------------ Generating Events ------------------------------------------------------------------------------ As PXP has also an event model to represent XML, the preprocessor can also produce such events. In particular, there are two modes: The "pxp_evlist" notation outputs lists of events (type "event list") representing the XML expression. The "pxp_evpull" notation creates an automaton from which one can "pull" events (like from a pull parser). These two notations work very much like "pxp_tree". For example, let book = <:pxp_evlist< <book> [ <title>[ "The Lord of The Rings" ] <author>[ "J.R.R. Tolkien" ] ] >> generates [ E_start_tag ("book", [], None, <obj>); E_start_tag ("title", [], None, <obj>); E_char_data "The Lord of The Rings"; E_end_tag ("title", <obj>); E_start_tag ("author", [], None, <obj>); E_char_data "J.R.R. Tolkien"; E_end_tag ("author", <obj>); E_end_tag ("book", <obj>) ] Note that you neither need a "dtd" variable nor a "spec" variable. There is one important difference, however: Both nodes and lists of nodes are represented by the same type, "event list". That has the consequence that in the following example x1 and x2 have the same type "event list": <:pxp_evlist< <element>x1 >> <:pxp_evlist< <element>[x2] >> <:pxp_evlist< <element><*>x3 >> In principle, it could be checked at runtime whether x1 and x2 have the right structure. However, this is not done because of performance reasons. As mentioned, "pxp_evpull" works like a pull parser. After defining let book = <:pxp_evpull< <book> [ <title>[ "The Lord of The Rings" ] <author>[ "J.R.R. Tolkien" ] ] >> "book" is a function 'a->event. One can call it to get the events one after the other: let e1 = book();; (* = Some(E_start_tag ("book", [], None, <obj>)) *) let e2 = book();; (* = Some(E_start_tag ("title", [], None, <obj>)) *) ... After the last event, "book" returns None to indicate the end of the event stream. As for "pxp_evlist", it is not possible to distinguish between nodes and node lists. In this example, both x1 and x2 are assumed to have type 'a->event: <:pxp_evlist< <element>x1 >> <:pxp_evlist< <element>[x2] >> <:pxp_evlist< <element><*>x3 >> Note that "<element>x1" actually means to build a new pull automaton around the existing pull automaton x1: The children of "element" are retrieved by pulling events from x1 until "None" is returned. A consequence of the pull semantics is that once an event is obtained from an automaton, the state of the automaton is modified such that it is not possible to get the same event again. If you need an automaton that can be reset to the beginning, just wrap the "pxp_evlist" notation into a functional abstraction: let book_maker() = <:pxp_evpull< <book ...> ... >>;; let book1 = book_maker();; let book2 = book_maker();; This way, "book1" and "book2" are independent event streams. There is another implication of the nature of the automatons: Subexpressions are lazily evaluated. For example, in <:pxp_evpull< <element>[ <*> (: get_data_contents() :) ] >> the call of get_data_contents is performed just before the event for the data node is constructed. ------------------------------------------------------------------------------ Namespaces ------------------------------------------------------------------------------ By default, the preprocessor does not generate nodes or events that support namespaces. It can, however, be configured to create namespace-aware XML aggregations. In any case, you need a namespace manager. This is an object that tracks the usage of namespace prefixes in XML nodes. For example, we can create a namespace manager that knows the "html" prefix: let mng = new namespace_manager in mng # add_namespace "html" "http://www.w3.org/1999/xhtml" Here, we declare that we want to use the "html" prefix for the internal representation of the XML nodes. This kind of prefix is called normalized prefix, or normprefix for short. It is possible to configure different prefixes for the external representation, i.e. when the XML tree is printed to a file. This other kind of prefix is called display prefix. We will have a look at them later. Next, we must tell the DTD object that we have a namespace manager: let dtd = Pxp_dtd.create_dtd `Enc_iso88591;; dtd # set_namespace_manager mng;; For "pxp_evlist" and "pxp_evpull" we are now prepared (note that we need now a "dtd" variable, as the DTD object knows the namespace manager). For "pxp_tree" and "pxp_vtree", it is required to use a namespace-aware specification: let spec = Pxp_tree_parser.default_namespace_spec (Normal specifications do not work, you would get "Namespace method not applicable" errors if you tried to use them.) The special notation "<:autoscope>" enables namespace mode in this example: let list = <:pxp_tree< <:autoscope> <html:ul> [ <html:li>"Item1" <html:li>"Item2" ] >> In particular, "<:autoscope>" defines a new O'Caml variable for its subexpression: "scope". This variable contains the namespace scope object, which contains the namespace declarations for the subexpression. "<:autoscope>" initialises this variable from the namespace manager such that it contains now a declaration for the "html" prefix. In general, the namespace scope object contains the prefixes to use for the external representation. For this simple example, we have chosen to use the same prefixes as for the internal representation, and "<:autoscope>" performs the right initialisations for this. Print the tree by list # display (`Out_channel stdout) `Enc_iso88591 The point is to call the "display" method and not the "write" method. The latter would not respect the display prefixes. Alternatively, we can also create the "scope" variable manually: let scope = Pxp_dtd.create_namespace_scope ~decl:[ "", "http://www.w3.org/1999/xhtml" ] mng;; let list = <:pxp_tree< <:scope> <html:ul> [ <html:li>"Item1" <html:li>"Item2" ] >> Note that we now use "<:scope>". In this simple form, this construct just enables namespace mode, and takes the "scope" variable from the environment. Furthermore, the namespace scope contains now a different namespace declaration: The display prefix "" is used for HTML. The empty prefix just means to declare a default prefix (by xmlns="URI"). The effect can be seen when the XML tree is printed by calling the "display" method. Here is a third variant of the same example: let scope = Pxp_dtd.create_namespace_scope mng ;; let list = <:pxp_tree< <:scope ("")="http://www.w3.org/1999/xhtml"> <html:ul> [ <html:li>"Item1" <html:li>"Item2" ] >> The "scope" is now initially empty. The "<:scope>" notation is used to extend the scope for the time the subexpression is evaluated. There is also a notation "<:emptyscope" that creates an empty scope object, so one could even write let list = <:pxp_tree< <:emptyscope> <:scope ("")="http://www.w3.org/1999/xhtml"> <html:ul> [ <html:li>"Item1" <html:li>"Item2" ] >> It is recommended to create the "scope" variable manually with a reasonable initial declaration, and to use "<:scope>" to enable namespace processing, and to extend the scope when necessary. The advantage of this approach is that the same scope object can be shared by many XML nodes, so you need less memory. One tip: To get a namespace scope that is initialised with all prefixes of the namespace manager (as <:autoscope> does it), define let scope = create_namespace_scope ~decl: mng#as_declaration mng For event-based processing of XML, the namespace mode works in the same way as described here, there is no difference.