Plasma GitLab Archive
Projects Blog Knowledge

******************************************************************************
README - PXP, the XML parser for O'Caml
******************************************************************************


==============================================================================
Abstract
==============================================================================

PXP is an XML parser for O'Caml. It represents the parsed document either as 
tree or as stream of events. In tree mode, it is possible to validate the XML 
document against a DTD. 

The acronym PXP means Polymorphic XML Parser. This name reflects the ability to 
create XML trees with polymorphic type parameters. 

==============================================================================
Download
==============================================================================

You can download PXP as gzip'ed tarball [1]. The parser needs the Ocamlnet [2] 
package (0.9.3). Note that PXP requires O'Caml 3.09 or newer. 

Information about the latest development version is available here [3]. 

==============================================================================
PXP Reference
==============================================================================

The manual is included in the distribution as bunch of HTML files. An online 
version can be found here [4]. 

==============================================================================
Author, Credits, Copying
==============================================================================

PXP has been written by Gerd Stolpmann [5]; it contains contributions by 
Claudio Sacerdoti Coen. You may copy it as you like, you may use it even for 
commercial purposes as long as the license conditions are respected, see the 
file LICENSE coming with the distribution. It allows almost everything. 

Thanks also to Alain Frisch and Haruo Hosoya for discussions and bug reports.

==============================================================================
Description
==============================================================================

PXP is a validating XML parser for O'Caml [6]. It strictly complies to the 
XML-1.0 [7] standard. 

The parser is simple to call, usually only one statement (function call) is 
sufficient to parse an XML document and to represent it as object tree.

Once the document is parsed, it can be accessed using a class interface. The 
interface allows arbitrary access including transformations. One of the 
features of the document representation is its polymorphic nature; it is simple 
to add custom methods to the document classes. Furthermore, the parser can be 
configured such that different XML elements are represented by objects created 
from different classes. This is a very powerful feature, because it simplifies 
the structure of programs processing XML documents. 

Note that the class interface does not comply to the DOM standard. It was not a 
development goal to realize a standard API (industrial developers can this much 
better than I); however, the API is powerful enough to be considered as 
equivalent with DOM. More important, the interface is compatible with the XML 
information model required by many XML-related standards. 

There is now also an event-oriented interface comparable to SAX. PXP also 
supports the popular pull parsing model. 

------------------------------------------------------------------------------
Detailed feature list
------------------------------------------------------------------------------

-  The XML instance is validated against the DTD; any violation of a validation 
   constraint leads to the rejection of the instance. The validator has been 
   carefully implemented, and conforms strictly to the standard. If needed, it 
   is also possible to run the parser in a well-formedness mode.
   
-  If possible, the validator applies a deterministic finite automaton to 
   validate the content models. This ensures that validation can always be 
   performed in linear time. However, in the case that the content models are 
   not deterministic, the parser uses a backtracking algorithm which can be 
   much slower. - It is also possible to reject non-deterministic content 
   models.
   
-  In particular, the validator also checks the complicated rules whether 
   parentheses are properly nested with respect to entities, and whether the 
   standalone declaration is satisfied. On demand, it is checked whether the 
   IDREF attributes only refer to existing nodes.
   
-  Entity references are automatically resolved while the XML text is being 
   scanned. It is not possible to recognize in the object tree where a 
   referenced entity begins or ends; the object tree only represents the 
   logical structure.
   
-  External entities are loaded using a configurable resolver infrastructure. 
   It is possible to connect the parser with an arbitrary XML source.
   
-  The parser can read XML text encoded in a variety of character sets. 
   Independent of this, it is possible to choose the encoding of the internal 
   representation of the tree nodes; the parser automatically converts the 
   input text to this encoding. Currently, the parser supports UTF-8 and 
   ISO-8859-1 as internal encodings.
   
-  The interface of the parser has been designed such that it is best 
   integrated into the language O'Caml. The first goal was simplicity of usage 
   which is achieved by many convenience methods and functions, and by allowing 
   the user to select which parts of the XML text are actually represented in 
   the tree. For example, it is possible to store processing instructions as 
   tree nodes, but the parser can also be configured such that these 
   instructions are put into hashtables. The information model is compatible 
   with the requirements of XML-related standards such as XPath.
   
-  In particular, the node tree can optionally contain or leave out processing 
   instructions and comments. It is also possible to generate a "super root" 
   object which is the parent of the root element. The attributes of elements 
   are normally not stored as nodes, but it is possible to get them wrapped 
   into nodes.
   
-  The powerful type system of O'Caml makes it possible that the parser 
   supports polymorphism based on the element types, i.e. it can be configured 
   that the parser selects different classes to represent different element 
   types. Note that no generator is needed for this feature. 
   
-  There is also an interface for DTDs; you can parse and access sequences of 
   declarations. The declarations are fully represented as recursive O'Caml 
   values. 
   
-  Since PXP 1.1, the parser supports namespaces. This has been implemented 
   using a technique called "prefix normalization", i.e. while parsing the 
   namespace prefixes are changed (in a configurable way) such that they become 
   unique in the whole document (or document domain). This is again a solution 
   that is different from other parsers, but it allows a very convenient style 
   of processing namespaces while sticking strictly to the XML standard. 
   Another advantage of this solution is that DTDs can refer to namespaces in a 
   transparent way, i.e. it is possible to validate a document against a DTD 
   that uses different namespace prefixes for the same namespaces.
   
------------------------------------------------------------------------------
Recent Changes
------------------------------------------------------------------------------

-  1.2.4: Fixed bug in namespace-aware parsing (thanks to ygrek and Thomas 
   Leonard)
   
-  1.2.3: Ported to OCaml-4.00 
   
-  1.2.2: Fixing the interaction of catalog and file resolution.
   Fix because of a change in Ocamlnet-3.3.1
   
-  1.2.1: Revised documentation
   Addition: Pxp_event.unwrap_document
   Addition: Pxp_dtd.Entity.lookup
   Addition: node method entity_id
   Addition: Pxp_event.close_entities
   Removed: Pxp_core_types_type, Pxp_type_anchor. Pxp_core_types  has now three 
   submodules A, S, I taking over the roles
   Removed: E_pinstr_member. Instead, E_pinstr events are emitted
   Renaming, and addition: `Entry_content has been renamed to  
   `Entry_element_content. A new `Entry_content with different  semantics has 
   been added, now conforming to a standard production.  
   Improvement: The parser also accepts a BOM as UTF-8 sequence.  Also, the 
   autodetection of the encoding for UTF-16 has been  enhanced
   Fix: Pxp_marshal module also keeps namespace scope objects
   Addition: method lexbuf in lexer_obj
   
-  1.2.0test*: New ~minimization option for the  [write] and [display] methods 
   (user wish).
   Improvement: better control what is printed as DTD for  document#write and 
   #display
   Fix: [Pxp_document.liquefy] terminates now when invoked  only on a subtree 
   of a document
   Cleaned up the code a bit so fewer warnings are emitted in the build.
   Ported pxp-pp to O'Caml 3.10
   
-  1.1.96: Works now for O'Caml 3.09, too.
   Fix: The "root element check" is disabled  in Pxp_dtd. It did not work 
   together with namespaces.
   Pxp_validate: Fix for namespace mode
   
-  1.1.95: Addition of ulex lexing.
   Fix in Pxp_reader.combine.
   Revised namespace handling: There are now namespace_scope  objects keeping 
   the scoping structure of the namespaces.  The namespace_info stuff has been 
   removed. The "display"  methods can print XML while respecting the scoping 
   structure.  
   New exceptions Namespace_not_managed, Namespace_prefix_not_managed,  
   Namespace_not_in_scope (all replacing Not_found). Methods  of 
   namespace_manager may raise these exceptions.  
   The event-based representation of XML is now symmetrical  to the tree-based 
   representation, such that it is possible  to convert one representation into 
   the other without loss.  The type of events had to be changed to achieve 
   this effect.  
   The new module Pxp_event contains functions for the event-based  
   representation.
   Addition of pxp-pp, the PXP preprocessor.
   This release requires Ocamlnet 0.98. You should also  install ulex. There 
   are no longer precompiled wlex lexers  (use ulex instead).
   
-  1.1.94.2: Again fixes for the combination of 3.07/wlex
   
-  1.1.94.1: Fixes for 3.07 concerning the pregenerated wlexers. - New: 
   Pxp_document.build_node_tree
   
-  1.1.94: The Pxp_reader module has been completely rewritten. This fixes some 
   problems with relative URLs. - Pxp_yacc has been split up into four modules: 
   Pxp_tree_parser contains now the parser API returning object trees, 
   Pxp_dtd_parser is the parser API returning DTDs, Pxp_ev_parser is the 
   event-based API, and Pxp_core_parser is the core of the parser. Pxp_yacc is 
   still available as compatibility API. As part of the module redesign, 
   Pxp_types includes now parts of its interface from Pxp_core_types_type. I 
   hope this style of programming is comprehensible. - I think PXP can now 
   compiled with CVS releases of O'Caml. - It is now possible to turn warnings 
   into errors. - The event-based parser can now preprocess namespaces. 
   Furthermore, there are normalization filters.
   
-  1.1.93: This is a bugfix release. Sometimes files were not closed in 
   previous versions, but now they are. There were debug statements in the pull 
   parser code, I have removed them. Finally, some errors in the Makefiles have 
   been corrected.
   
-  1.1.92: The whole lexing stuff has been restructured. There is a new tool, 
   lexpp, that generates the lexers from only five files. Furthermore, much 
   more 8 bit character sets are now supported as internal encodings. In 
   previous versions of PXP, the internal representation of the XML trees was 
   restricted to either UTF-8 or ISO-8859-1. Now, a number of additional 
   encodings are supported, including the whole ISO-8859 series. 
   Bugfix: If the processing instruction <?xml...?> occurs in the middle of the 
   XML document, version 1.1.91 will immediately stop parsing, and ignore the 
   rest of the file. This is now fixed.
   
-  1.1.91: The curly braces can now even be used inside attributes, and escape 
   from normal XML parsing.
   There is a new entry point Entry_expr for event-based parsing that expects 
   either a single element, a single processing instruction, or a single 
   comment, or whitespace. This allows more fine-grained control of what is 
   parsed.
   There is now a "pull parser". In contrast to the "push parser" introduced in 
   1.1.90, the calling order of parser and parser user have been inverted, i.e. 
   the user calls the parser to get ("pull") the next event instead of letting 
   the parser call back a user function ("push"). An interesting application is 
   that O'Caml's lazy streams can be used to analyze events. An example can be 
   found in examles/pullparser.
   Pull parsing is not yet well-tested!
   
-  1.1.90: This version introduces a new event-based interface in Pxp_yacc. For 
   start tags, end tags, data strings, and several other things that are found 
   in the XML source so-called events are generated, and a user function is 
   called for every event. See the directory examples/eventparser for examples.
   Another innovation is support for curly braces as escape characters. Inside 
   elements, the left curly brace escapes from XML parsing and starts a foreign 
   parser until the matching right curly brace is found: 
   
   <element> ... { foreign syntax } ... </element>
   
   The curly braces are borrowed from the XQuery draft standard. They cannot 
   yet be used inside attribute values. Curly braces are mostly useful in 
   conjunction with event-based parsing, because it is not yet possible to 
   include the "value" of the curly brace expression into XML trees.
   It is even possible to call the XML parser from the foreign parser as 
   subparser. However, there not yet enough entry points for the event-based 
   parser (e.g. you cannot parse just the following processing instruction, 
   only misc* element misc* or whole documents are possible). 
   A long-standing bug has been found in the entity layer. When an external 
   entity A opens an external entity B, and B opens C, relative paths of C have 
   been interpreted wrong.
   
-  Changed in 1.1.5:
   A packaging error in pxp-wlex has been corrected. (This is the only change.)
   
-  Changed in 1.1.4:
   This is a bigger bug fix release that addresses the following problems:
   The parser does no longer show very bad performance when large data nodes 
   without line feeds are parsed.
   Another performance problem with bigger DTDs has been solved, too. 
   Especially, the XHTML DTD can now be parsed quite quickly.
   The interface Pxp_dtd.Entity has been extended, it is now possible to access 
   more properties of entities than before.
   Pxp_marshal has been revised. It is now possible to recode the input or 
   output stream on the fly in order to change the character encoding. 
   Furthermore, the functions relocate_subtree and relocate_document allows one 
   to marshal an XML tree or a document, and to read the marshaled data 
   immediately to create a copy of the original structure.
   Some errors have been fixed in from_file. Especially, this function will no 
   longer raise the exception Malformed_URL if the current working directory 
   happens to be "/".
   Pxp_document.strip_whitespace implements xml:space now correctly. In 
   previous versions, xml:space='default' was ignored when it occurred inside 
   an element with xml:space='preserve'. Now the inner xml:space='default' 
   overrides the outer xml:space='preserve' as defined in the XML standard.
   
-  Changed in 1.1.3:
   This release fixes a single problem occurring when PXP is compiled with 
   installed netstring-0.10. (There is not any problem with netstring-0.91.) 
   
-  Changed in 1.1.2:
   Improved write method for whole documents. It can now also output a 
   reference to an external DTD.
   PXP can be compiled with O'Caml 3.04.
   
-  Changed in 1.1.1:
   Minor changes for O'Caml 3.03-alpha. The interfaces have not been modified.
   
-  Changed in 1.1:
   The parser supports now namespaces.
   Extended/updated Pxp_document interface. There is now a  separate class for 
   every node type. It is now clear which  node methods validate and which do 
   not validate. The node tree  can now be simpler modified (insert/delete).  
   It is now possible to start in  well-formedness mode and validate the XML 
   tree later (as  a whole, or partially).
   New functions for tree normalization, and whitespace  stripping.
   The implementation of Pxp_document has been updated, too.  There are now 
   many virtual classes, one class for one task.  The attribute representation 
   has been improved. The overall  size of the document tree has been reduced.  
   The parser is better in counting lines. The option  errors_with_line_numbers 
   could be removed because the parser  is now fast enough that it does make 
   sense to always count  lines.
   There are now string pools that can save memory in some  situations.
   New module Pxp_marshal allows marshalling of XML trees  over channels 
   (faster than writing the tree and reparsing  it).
   For the most important entity functions there is an  interface 
   Pxp_document.Entity.
   Although there are many extensions, the parser has been  sped up.
   The parser has been divided up into several packages, and  the directory 
   structure of the distribution has been cleaned  up.
   It is possible to chose among several lexical analyzers.  One of them bases 
   on Alain Frisch's wlex patch, which reduces  the size of executables if a 
   UTF-8 parser is needed.
   The parser works under Cygwin.
   Of course several bug fixes. Note that most bugs have been  added in the 
   development cycle between 1.0 and 1.1; only  very few problems have been 
   detected in the 1.0 release. I  hope that 1.1 has similar quality.
   

--------------------------

[1]   see http://download.camlcity.org/download/pxp-1.1.6.tar.gz

[2]   see /projects/ocamlnet.html

[3]   see /projects/pxp.html

[4]   see /projects/dl/pxp-1.1.6/doc/manual/html/index.html

[5]   see mailto:gerd@gerd-stolpmann.de

[6]   see http://caml.inria.fr/

[7]   see http://www.w3.org/TR/1998/REC-xml-19980210.html




This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml