Plasma GitLab Archive
Projects Blog Knowledge

README - PXP, the XML parser for O'Caml


PXP is a validating parser for XML-1.0 which has been written entirely in 
Objective Caml. 

PXP means "Polymorphic XML parser" and emphasizes its most useful property: 
that the API is polymorphic and can be configured such that different objects 
are used to store different types of elements.


You can download PXP as gzip'ed tarball [1]. The parser needs the Netstring [2] 
package (0.9.3). Note that PXP requires O'Caml 3.00. 

Information about the latest development version is available here [3]. 

User's Manual

The manual is included in the distribution both as Postscript document and 
bunch of HTML files. An online version can be found here [4]. 

Author, Credits, Copying

PXP has been written by Gerd Stolpmann [5]; it contains contributions by 
Claudio Sacerdoti Coen. You may copy it as you like, you may use it even for 
commercial purposes as long as the license conditions are respected, see the 
file LICENSE coming with the distribution. It allows almost everything. 

Thanks also to Alain Frisch and Haruo Hosoya for discussions and bug reports.


PXP is a validating XML parser for O'Caml [6]. It strictly complies to the 
XML-1.0 [7] standard. 

The parser is simple to call, usually only one statement (function call) is 
sufficient to parse an XML document and to represent it as object tree.

Once the document is parsed, it can be accessed using a class interface. The 
interface allows arbitrary access including transformations. One of the 
features of the document representation is its polymorphic nature; it is simple 
to add custom methods to the document classes. Furthermore, the parser can be 
configured such that different XML elements are represented by objects created 
from different classes. This is a very powerful feature, because it simplifies 
the structure of programs processing XML documents. 

Note that the class interface does not comply to the DOM standard. It was not a 
development goal to realize a standard API (industrial developers can this much 
better than I); however, the API is powerful enough to be considered as 
equivalent with DOM. More important, the interface is compatible with the XML 
information model required by many XML-related standards. 

Detailed feature list

-  The XML instance is validated against the DTD; any violation of a validation 
   constraint leads to the rejection of the instance. The validator has been 
   carefully implemented, and conforms strictly to the standard. If needed, it 
   is also possible to run the parser in a well-formedness mode.
-  If possible, the validator applies a deterministic finite automaton to 
   validate the content models. This ensures that validation can always be 
   performed in linear time. However, in the case that the content models are 
   not deterministic, the parser uses a backtracking algorithm which can be 
   much slower. - It is also possible to reject non-deterministic content 
-  In particular, the validator also checks the complicated rules whether 
   parentheses are properly nested with respect to entities, and whether the 
   standalone declaration is satisfied. On demand, it is checked whether the 
   IDREF attributes only refer to existing nodes.
-  Entity references are automatically resolved while the XML text is being 
   scanned. It is not possible to recognize in the object tree where a 
   referenced entity begins or ends; the object tree only represents the 
   logical structure.
-  External entities are loaded using a configurable resolver infrastructure. 
   It is possible to connect the parser with an arbitrary XML source.
-  The parser can read XML text encoded in a variety of character sets. 
   Independent of this, it is possible to choose the encoding of the internal 
   representation of the tree nodes; the parser automatically converts the 
   input text to this encoding. Currently, the parser supports UTF-8 and 
   ISO-8859-1 as internal encodings.
-  The interface of the parser has been designed such that it is best 
   integrated into the language O'Caml. The first goal was simplicity of usage 
   which is achieved by many convenience methods and functions, and by allowing 
   the user to select which parts of the XML text are actually represented in 
   the tree. For example, it is possible to store processing instructions as 
   tree nodes, but the parser can also be configured such that these 
   instructions are put into hashtables. The information model is compatible 
   with the requirements of XML-related standards such as XPath.
-  In particular, the node tree can optionally contain or leave out processing 
   instructions and comments. It is also possible to generate a "super root" 
   object which is the parent of the root element. The attributes of elements 
   are normally not stored as nodes, but it is possible to get them wrapped 
   into nodes.
-  The powerful type system of O'Caml makes it possible that the parser 
   supports polymorphism based on the element types, i.e. it can be configured 
   that the parser selects different classes to represent different element 
   types. Note that no generator is needed for this feature. 
-  There is also an interface for DTDs; you can parse and access sequences of 
   declarations. The declarations are fully represented as recursive O'Caml 
-  Since PXP 1.1, the parser supports namespaces. This has been implemented 
   using a technique called "prefix normalization", i.e. while parsing the 
   namespace prefixes are changed (in a configurable way) such that they become 
   unique in the whole document (or document domain). This is again a solution 
   that is different from other parsers, but it allows a very convenient style 
   of processing namespaces while sticking strictly to the XML standard. 
   Another advantage of this solution is that DTDs can refer to namespaces in a 
   transparent way, i.e. it is possible to validate a document against a DTD 
   that uses different namespace prefixes for the same namespaces.
Code examples

This distribution contains several examples:

-  validate: simply parses a document and prints all error messages 
-  readme: Defines a DTD for simple "README"-like documents, and offers 
   conversion to HTML and text files [8]. 
-  xmlforms: This is already a sophisticated application that uses XML as style 
   sheet language and data storage format. It shows how a Tk user interface can 
   be configured by an XML style, and how data records can be stored using XML. 
Restrictions and missing features

The following restrictions apply that are not violations of the standard: 

-  The attribute "xml:lang" is not supported specially.  (The application can 
   do this.)
-  It is currently not possible to check for interoperatibility with SGML. 
The following features are also missing:

-  There is no support for XPATH or XSLT.
However, I hope that these features will be implemented soon, either by myself 
or by contributors (who are invited to do so).

Release notes

The Release Notes [9] describe problems and solutions that are detected shortly 
before or after the release.

Recent Changes

-  Changed in 1.1.6:
   Updates for O'Caml 3.07
-  Changed in 1.1.5:
   A packaging error in pxp-wlex has been corrected. (This is the only change.)
-  Changed in 1.1.4:
   This is a bigger bugfix release that addresses the following problems:
   The parser does no longer show very bad performance when large data nodes 
   without linefeeds are parsed.
   Another performance problem with bigger DTDs has been solved, too. 
   Especially, the XHTML DTD can now be parsed quite quickly.
   The interface Pxp_dtd.Entity has been extended, it is now possible to access 
   more properties of entities than before.
   Pxp_marshal has been revised. It is now possible to recode the input or 
   output stream on the fly in order to change the character encoding. 
   Furthermore, the functions relocate_subtree and relocate_document allows one 
   to marshal an XML tree or a document, and to read the marshaled data 
   immediately to create a copy of the original structure.
   Some errors have been fixed in from_file. Especially, this function will no 
   longer raise the exception Malformed_URL if the current working directory 
   happens to be "/".
   Pxp_document.strip_whitespace implements xml:space now correctly. In 
   previous versions, xml:space='default' was ignored when it occurred inside 
   an element with xml:space='preserve'. Now the inner xml:space='default' 
   overrides the outer xml:space='preserve' as defined in the XML standard.
-  Changed in 1.1.3:
   This release fixes a single problem ocurring when PXP is compiled with 
   installed netstring-0.10. (There is not any problem with netstring-0.91.) 
-  Changed in 1.1.2:
   Improved write method for whole documents. It can now also output a 
   reference to an external DTD.
   PXP can be compiled with O'Caml 3.04.
-  Changed in 1.1.1:
   Minor changes for O'Caml 3.03-alpha. The interfaces have not been modified.
-  Changed in 1.1:
   The parser supports now namespaces.
   Extended/updated Pxp_document interface. There is now a  separate class for 
   every node type. It is now clear which  node methods validate and which do 
   not validate. The node tree  can now be simpler modified (insert/delete).  
   It is now possible to start in  well-formedness mode and validate the XML 
   tree later (as  a whole, or partially).
   New functions for tree normalization, and whitespace  stripping.
   The implementation of Pxp_document has been updated, too.  There are now 
   many virtual classes, one class for one task.  The attribute representation 
   has been improved. The overall  size of the document tree has been reduced.  
   The parser is better in counting lines. The option  errors_with_line_numbers 
   could be removed because the parser  is now fast enough that it does make 
   sense to always count  lines.
   There are now string pools that can save memory in some  situations.
   New module Pxp_marshal allows marshalling of XML trees  over channels 
   (faster than writing the tree and reparsing  it).
   For the most important entity functions there is an  interface 
   Although there are many extensions, the parser has been  speeded up.
   The parser has been divided up into several packages, and  the directory 
   structure of the distribution has been cleaned  up.
   It is possible to chose among several lexical analyzers.  One of them bases 
   on Alain Frisch's wlex patch, which reduces  the size of executables if a 
   UTF-8 parser is needed.
   The parser works under Cygwin.
   Of course several bugfixes. Note that most bugs have been  added in the 
   development cycle between 1.0 and 1.1; only  very few problems have been 
   detected in the 1.0 release. I  hope that 1.1 has similar quality.


[1]   see

[2]   see


[4]   see

[5]   see

[6]   see

[7]   see

[8]   This particular document is an example of this DTD!


This web site is published by Informatikb├╝ro Gerd Stolpmann
Powered by Caml