******************************************************************************
README - PXP, the XML parser for O'Caml
******************************************************************************
==============================================================================
Abstract
==============================================================================
PXP is an XML parser for O'Caml. It represents the parsed document either as
tree or as stream of events. In tree mode, it is possible to validate the XML
document against a DTD.
The acronym PXP means Polymorphic XML Parser. This name reflects the ability to
create XML trees with polymorphic type parameters.
==============================================================================
Download
==============================================================================
You can download PXP as gzip'ed tarball [1]. The parser needs the Ocamlnet [2]
package (0.9.3). Note that PXP requires O'Caml 3.09 or newer.
Information about the latest development version is available here [3].
==============================================================================
PXP Reference
==============================================================================
The manual is included in the distribution as bunch of HTML files. An online
version can be found here [4].
==============================================================================
Author, Credits, Copying
==============================================================================
PXP has been written by Gerd Stolpmann [5]; it contains contributions by
Claudio Sacerdoti Coen. You may copy it as you like, you may use it even for
commercial purposes as long as the license conditions are respected, see the
file LICENSE coming with the distribution. It allows almost everything.
Thanks also to Alain Frisch and Haruo Hosoya for discussions and bug reports.
==============================================================================
Description
==============================================================================
PXP is a validating XML parser for O'Caml [6]. It strictly complies to the
XML-1.0 [7] standard.
The parser is simple to call, usually only one statement (function call) is
sufficient to parse an XML document and to represent it as object tree.
Once the document is parsed, it can be accessed using a class interface. The
interface allows arbitrary access including transformations. One of the
features of the document representation is its polymorphic nature; it is simple
to add custom methods to the document classes. Furthermore, the parser can be
configured such that different XML elements are represented by objects created
from different classes. This is a very powerful feature, because it simplifies
the structure of programs processing XML documents.
Note that the class interface does not comply to the DOM standard. It was not a
development goal to realize a standard API (industrial developers can this much
better than I); however, the API is powerful enough to be considered as
equivalent with DOM. More important, the interface is compatible with the XML
information model required by many XML-related standards.
There is now also an event-oriented interface comparable to SAX. PXP also
supports the popular pull parsing model.
------------------------------------------------------------------------------
Detailed feature list
------------------------------------------------------------------------------
- The XML instance is validated against the DTD; any violation of a validation
constraint leads to the rejection of the instance. The validator has been
carefully implemented, and conforms strictly to the standard. If needed, it
is also possible to run the parser in a well-formedness mode.
- If possible, the validator applies a deterministic finite automaton to
validate the content models. This ensures that validation can always be
performed in linear time. However, in the case that the content models are
not deterministic, the parser uses a backtracking algorithm which can be
much slower. - It is also possible to reject non-deterministic content
models.
- In particular, the validator also checks the complicated rules whether
parentheses are properly nested with respect to entities, and whether the
standalone declaration is satisfied. On demand, it is checked whether the
IDREF attributes only refer to existing nodes.
- Entity references are automatically resolved while the XML text is being
scanned. It is not possible to recognize in the object tree where a
referenced entity begins or ends; the object tree only represents the
logical structure.
- External entities are loaded using a configurable resolver infrastructure.
It is possible to connect the parser with an arbitrary XML source.
- The parser can read XML text encoded in a variety of character sets.
Independent of this, it is possible to choose the encoding of the internal
representation of the tree nodes; the parser automatically converts the
input text to this encoding. Currently, the parser supports UTF-8 and
ISO-8859-1 as internal encodings.
- The interface of the parser has been designed such that it is best
integrated into the language O'Caml. The first goal was simplicity of usage
which is achieved by many convenience methods and functions, and by allowing
the user to select which parts of the XML text are actually represented in
the tree. For example, it is possible to store processing instructions as
tree nodes, but the parser can also be configured such that these
instructions are put into hashtables. The information model is compatible
with the requirements of XML-related standards such as XPath.
- In particular, the node tree can optionally contain or leave out processing
instructions and comments. It is also possible to generate a "super root"
object which is the parent of the root element. The attributes of elements
are normally not stored as nodes, but it is possible to get them wrapped
into nodes.
- The powerful type system of O'Caml makes it possible that the parser
supports polymorphism based on the element types, i.e. it can be configured
that the parser selects different classes to represent different element
types. Note that no generator is needed for this feature.
- There is also an interface for DTDs; you can parse and access sequences of
declarations. The declarations are fully represented as recursive O'Caml
values.
- Since PXP 1.1, the parser supports namespaces. This has been implemented
using a technique called "prefix normalization", i.e. while parsing the
namespace prefixes are changed (in a configurable way) such that they become
unique in the whole document (or document domain). This is again a solution
that is different from other parsers, but it allows a very convenient style
of processing namespaces while sticking strictly to the XML standard.
Another advantage of this solution is that DTDs can refer to namespaces in a
transparent way, i.e. it is possible to validate a document against a DTD
that uses different namespace prefixes for the same namespaces.
------------------------------------------------------------------------------
Recent Changes
------------------------------------------------------------------------------
- 1.2.9: Build with OCaml-4.04.0
- 1.2.8: Build against ocamlnet-4.1. Support -safe-string
- 1.2.7: tweaking support for oasis.
- 1.2.6: Adding _oasis file.
- 1.2.5: Build against ocamlnet-4.
- 1.2.4: Fixed bug in namespace-aware parsing (thanks to ygrek and Thomas
Leonard)
- 1.2.3: Ported to OCaml-4.00
- 1.2.2: Fixing the interaction of catalog and file resolution.
Fix because of a change in Ocamlnet-3.3.1
- 1.2.1: Revised documentation
Addition: Pxp_event.unwrap_document
Addition: Pxp_dtd.Entity.lookup
Addition: node method entity_id
Addition: Pxp_event.close_entities
Removed: Pxp_core_types_type, Pxp_type_anchor. Pxp_core_types has now three
submodules A, S, I taking over the roles
Removed: E_pinstr_member. Instead, E_pinstr events are emitted
Renaming, and addition: `Entry_content has been renamed to
`Entry_element_content. A new `Entry_content with different semantics has
been added, now conforming to a standard production.
Improvement: The parser also accepts a BOM as UTF-8 sequence. Also, the
autodetection of the encoding for UTF-16 has been enhanced
Fix: Pxp_marshal module also keeps namespace scope objects
Addition: method lexbuf in lexer_obj
- 1.2.0test*: New ~minimization option for the [write] and [display] methods
(user wish).
Improvement: better control what is printed as DTD for document#write and
#display
Fix: [Pxp_document.liquefy] terminates now when invoked only on a subtree
of a document
Cleaned up the code a bit so fewer warnings are emitted in the build.
Ported pxp-pp to O'Caml 3.10
- 1.1.96: Works now for O'Caml 3.09, too.
Fix: The "root element check" is disabled in Pxp_dtd. It did not work
together with namespaces.
Pxp_validate: Fix for namespace mode
- 1.1.95: Addition of ulex lexing.
Fix in Pxp_reader.combine.
Revised namespace handling: There are now namespace_scope objects keeping
the scoping structure of the namespaces. The namespace_info stuff has been
removed. The "display" methods can print XML while respecting the scoping
structure.
New exceptions Namespace_not_managed, Namespace_prefix_not_managed,
Namespace_not_in_scope (all replacing Not_found). Methods of
namespace_manager may raise these exceptions.
The event-based representation of XML is now symmetrical to the tree-based
representation, such that it is possible to convert one representation into
the other without loss. The type of events had to be changed to achieve
this effect.
The new module Pxp_event contains functions for the event-based
representation.
Addition of pxp-pp, the PXP preprocessor.
This release requires Ocamlnet 0.98. You should also install ulex. There
are no longer precompiled wlex lexers (use ulex instead).
- 1.1.94.2: Again fixes for the combination of 3.07/wlex
- 1.1.94.1: Fixes for 3.07 concerning the pregenerated wlexers. - New:
Pxp_document.build_node_tree
- 1.1.94: The Pxp_reader module has been completely rewritten. This fixes some
problems with relative URLs. - Pxp_yacc has been split up into four modules:
Pxp_tree_parser contains now the parser API returning object trees,
Pxp_dtd_parser is the parser API returning DTDs, Pxp_ev_parser is the
event-based API, and Pxp_core_parser is the core of the parser. Pxp_yacc is
still available as compatibility API. As part of the module redesign,
Pxp_types includes now parts of its interface from Pxp_core_types_type. I
hope this style of programming is comprehensible. - I think PXP can now
compiled with CVS releases of O'Caml. - It is now possible to turn warnings
into errors. - The event-based parser can now preprocess namespaces.
Furthermore, there are normalization filters.
- 1.1.93: This is a bugfix release. Sometimes files were not closed in
previous versions, but now they are. There were debug statements in the pull
parser code, I have removed them. Finally, some errors in the Makefiles have
been corrected.
- 1.1.92: The whole lexing stuff has been restructured. There is a new tool,
lexpp, that generates the lexers from only five files. Furthermore, much
more 8 bit character sets are now supported as internal encodings. In
previous versions of PXP, the internal representation of the XML trees was
restricted to either UTF-8 or ISO-8859-1. Now, a number of additional
encodings are supported, including the whole ISO-8859 series.
Bugfix: If the processing instruction <?xml...?> occurs in the middle of the
XML document, version 1.1.91 will immediately stop parsing, and ignore the
rest of the file. This is now fixed.
- 1.1.91: The curly braces can now even be used inside attributes, and escape
from normal XML parsing.
There is a new entry point Entry_expr for event-based parsing that expects
either a single element, a single processing instruction, or a single
comment, or whitespace. This allows more fine-grained control of what is
parsed.
There is now a "pull parser". In contrast to the "push parser" introduced in
1.1.90, the calling order of parser and parser user have been inverted, i.e.
the user calls the parser to get ("pull") the next event instead of letting
the parser call back a user function ("push"). An interesting application is
that O'Caml's lazy streams can be used to analyze events. An example can be
found in examles/pullparser.
Pull parsing is not yet well-tested!
- 1.1.90: This version introduces a new event-based interface in Pxp_yacc. For
start tags, end tags, data strings, and several other things that are found
in the XML source so-called events are generated, and a user function is
called for every event. See the directory examples/eventparser for examples.
Another innovation is support for curly braces as escape characters. Inside
elements, the left curly brace escapes from XML parsing and starts a foreign
parser until the matching right curly brace is found:
<element> ... { foreign syntax } ... </element>
The curly braces are borrowed from the XQuery draft standard. They cannot
yet be used inside attribute values. Curly braces are mostly useful in
conjunction with event-based parsing, because it is not yet possible to
include the "value" of the curly brace expression into XML trees.
It is even possible to call the XML parser from the foreign parser as
subparser. However, there not yet enough entry points for the event-based
parser (e.g. you cannot parse just the following processing instruction,
only misc* element misc* or whole documents are possible).
A long-standing bug has been found in the entity layer. When an external
entity A opens an external entity B, and B opens C, relative paths of C have
been interpreted wrong.
- Changed in 1.1.5:
A packaging error in pxp-wlex has been corrected. (This is the only change.)
- Changed in 1.1.4:
This is a bigger bug fix release that addresses the following problems:
The parser does no longer show very bad performance when large data nodes
without line feeds are parsed.
Another performance problem with bigger DTDs has been solved, too.
Especially, the XHTML DTD can now be parsed quite quickly.
The interface Pxp_dtd.Entity has been extended, it is now possible to access
more properties of entities than before.
Pxp_marshal has been revised. It is now possible to recode the input or
output stream on the fly in order to change the character encoding.
Furthermore, the functions relocate_subtree and relocate_document allows one
to marshal an XML tree or a document, and to read the marshaled data
immediately to create a copy of the original structure.
Some errors have been fixed in from_file. Especially, this function will no
longer raise the exception Malformed_URL if the current working directory
happens to be "/".
Pxp_document.strip_whitespace implements xml:space now correctly. In
previous versions, xml:space='default' was ignored when it occurred inside
an element with xml:space='preserve'. Now the inner xml:space='default'
overrides the outer xml:space='preserve' as defined in the XML standard.
- Changed in 1.1.3:
This release fixes a single problem occurring when PXP is compiled with
installed netstring-0.10. (There is not any problem with netstring-0.91.)
- Changed in 1.1.2:
Improved write method for whole documents. It can now also output a
reference to an external DTD.
PXP can be compiled with O'Caml 3.04.
- Changed in 1.1.1:
Minor changes for O'Caml 3.03-alpha. The interfaces have not been modified.
- Changed in 1.1:
The parser supports now namespaces.
Extended/updated Pxp_document interface. There is now a separate class for
every node type. It is now clear which node methods validate and which do
not validate. The node tree can now be simpler modified (insert/delete).
It is now possible to start in well-formedness mode and validate the XML
tree later (as a whole, or partially).
New functions for tree normalization, and whitespace stripping.
The implementation of Pxp_document has been updated, too. There are now
many virtual classes, one class for one task. The attribute representation
has been improved. The overall size of the document tree has been reduced.
The parser is better in counting lines. The option errors_with_line_numbers
could be removed because the parser is now fast enough that it does make
sense to always count lines.
There are now string pools that can save memory in some situations.
New module Pxp_marshal allows marshalling of XML trees over channels
(faster than writing the tree and reparsing it).
For the most important entity functions there is an interface
Pxp_document.Entity.
Although there are many extensions, the parser has been sped up.
The parser has been divided up into several packages, and the directory
structure of the distribution has been cleaned up.
It is possible to chose among several lexical analyzers. One of them bases
on Alain Frisch's wlex patch, which reduces the size of executables if a
UTF-8 parser is needed.
The parser works under Cygwin.
Of course several bug fixes. Note that most bugs have been added in the
development cycle between 1.0 and 1.1; only very few problems have been
detected in the 1.0 release. I hope that 1.1 has similar quality.
--------------------------
[1] see http://download.camlcity.org/download/pxp-1.1.6.tar.gz
[2] see /projects/ocamlnet.html
[3] see /projects/pxp.html
[4] see /projects/dl/pxp-1.1.6/doc/manual/html/index.html
[5] see mailto:gerd@gerd-stolpmann.de
[6] see http://caml.inria.fr/
[7] see http://www.w3.org/TR/1998/REC-xml-19980210.html