****************************************************************************** README - PXP, the XML parser for O'Caml ****************************************************************************** ============================================================================== Abstract ============================================================================== PXP is an XML parser for O'Caml. It represents the parsed document either as tree or as stream of events. In tree mode, it is possible to validate the XML document against a DTD. The acronym PXP means Polymorphic XML Parser. This name reflects the ability to create XML trees with polymorphic type parameters. ============================================================================== Download ============================================================================== You can download PXP as gzip'ed tarball [1]. The parser needs the Ocamlnet [2] package (0.9.3). Note that PXP requires O'Caml 3.09 or newer. Information about the latest development version is available here [3]. ============================================================================== PXP Reference ============================================================================== The manual is included in the distribution as bunch of HTML files. An online version can be found here [4]. ============================================================================== Author, Credits, Copying ============================================================================== PXP has been written by Gerd Stolpmann [5]; it contains contributions by Claudio Sacerdoti Coen. You may copy it as you like, you may use it even for commercial purposes as long as the license conditions are respected, see the file LICENSE coming with the distribution. It allows almost everything. Thanks also to Alain Frisch and Haruo Hosoya for discussions and bug reports. ============================================================================== Description ============================================================================== PXP is a validating XML parser for O'Caml [6]. It strictly complies to the XML-1.0 [7] standard. The parser is simple to call, usually only one statement (function call) is sufficient to parse an XML document and to represent it as object tree. Once the document is parsed, it can be accessed using a class interface. The interface allows arbitrary access including transformations. One of the features of the document representation is its polymorphic nature; it is simple to add custom methods to the document classes. Furthermore, the parser can be configured such that different XML elements are represented by objects created from different classes. This is a very powerful feature, because it simplifies the structure of programs processing XML documents. Note that the class interface does not comply to the DOM standard. It was not a development goal to realize a standard API (industrial developers can this much better than I); however, the API is powerful enough to be considered as equivalent with DOM. More important, the interface is compatible with the XML information model required by many XML-related standards. There is now also an event-oriented interface comparable to SAX. PXP also supports the popular pull parsing model. ------------------------------------------------------------------------------ Detailed feature list ------------------------------------------------------------------------------ - The XML instance is validated against the DTD; any violation of a validation constraint leads to the rejection of the instance. The validator has been carefully implemented, and conforms strictly to the standard. If needed, it is also possible to run the parser in a well-formedness mode. - If possible, the validator applies a deterministic finite automaton to validate the content models. This ensures that validation can always be performed in linear time. However, in the case that the content models are not deterministic, the parser uses a backtracking algorithm which can be much slower. - It is also possible to reject non-deterministic content models. - In particular, the validator also checks the complicated rules whether parentheses are properly nested with respect to entities, and whether the standalone declaration is satisfied. On demand, it is checked whether the IDREF attributes only refer to existing nodes. - Entity references are automatically resolved while the XML text is being scanned. It is not possible to recognize in the object tree where a referenced entity begins or ends; the object tree only represents the logical structure. - External entities are loaded using a configurable resolver infrastructure. It is possible to connect the parser with an arbitrary XML source. - The parser can read XML text encoded in a variety of character sets. Independent of this, it is possible to choose the encoding of the internal representation of the tree nodes; the parser automatically converts the input text to this encoding. Currently, the parser supports UTF-8 and ISO-8859-1 as internal encodings. - The interface of the parser has been designed such that it is best integrated into the language O'Caml. The first goal was simplicity of usage which is achieved by many convenience methods and functions, and by allowing the user to select which parts of the XML text are actually represented in the tree. For example, it is possible to store processing instructions as tree nodes, but the parser can also be configured such that these instructions are put into hashtables. The information model is compatible with the requirements of XML-related standards such as XPath. - In particular, the node tree can optionally contain or leave out processing instructions and comments. It is also possible to generate a "super root" object which is the parent of the root element. The attributes of elements are normally not stored as nodes, but it is possible to get them wrapped into nodes. - The powerful type system of O'Caml makes it possible that the parser supports polymorphism based on the element types, i.e. it can be configured that the parser selects different classes to represent different element types. Note that no generator is needed for this feature. - There is also an interface for DTDs; you can parse and access sequences of declarations. The declarations are fully represented as recursive O'Caml values. - Since PXP 1.1, the parser supports namespaces. This has been implemented using a technique called "prefix normalization", i.e. while parsing the namespace prefixes are changed (in a configurable way) such that they become unique in the whole document (or document domain). This is again a solution that is different from other parsers, but it allows a very convenient style of processing namespaces while sticking strictly to the XML standard. Another advantage of this solution is that DTDs can refer to namespaces in a transparent way, i.e. it is possible to validate a document against a DTD that uses different namespace prefixes for the same namespaces. ------------------------------------------------------------------------------ Recent Changes ------------------------------------------------------------------------------ - 1.2.2: Fixing the interaction of catalog and file resolution. Fix because of a change in Ocamlnet-3.3.1 - 1.2.1: Revised documentation Addition: Pxp_event.unwrap_document Addition: Pxp_dtd.Entity.lookup Addition: node method entity_id Addition: Pxp_event.close_entities Removed: Pxp_core_types_type, Pxp_type_anchor. Pxp_core_types has now three submodules A, S, I taking over the roles Removed: E_pinstr_member. Instead, E_pinstr events are emitted Renaming, and addition: `Entry_content has been renamed to `Entry_element_content. A new `Entry_content with different semantics has been added, now conforming to a standard production. Improvement: The parser also accepts a BOM as UTF-8 sequence. Also, the autodetection of the encoding for UTF-16 has been enhanced Fix: Pxp_marshal module also keeps namespace scope objects Addition: method lexbuf in lexer_obj - 1.2.0test*: New ~minimization option for the [write] and [display] methods (user wish). Improvement: better control what is printed as DTD for document#write and #display Fix: [Pxp_document.liquefy] terminates now when invoked only on a subtree of a document Cleaned up the code a bit so fewer warnings are emitted in the build. Ported pxp-pp to O'Caml 3.10 - 1.1.96: Works now for O'Caml 3.09, too. Fix: The "root element check" is disabled in Pxp_dtd. It did not work together with namespaces. Pxp_validate: Fix for namespace mode - 1.1.95: Addition of ulex lexing. Fix in Pxp_reader.combine. Revised namespace handling: There are now namespace_scope objects keeping the scoping structure of the namespaces. The namespace_info stuff has been removed. The "display" methods can print XML while respecting the scoping structure. New exceptions Namespace_not_managed, Namespace_prefix_not_managed, Namespace_not_in_scope (all replacing Not_found). Methods of namespace_manager may raise these exceptions. The event-based representation of XML is now symmetrical to the tree-based representation, such that it is possible to convert one representation into the other without loss. The type of events had to be changed to achieve this effect. The new module Pxp_event contains functions for the event-based representation. Addition of pxp-pp, the PXP preprocessor. This release requires Ocamlnet 0.98. You should also install ulex. There are no longer precompiled wlex lexers (use ulex instead). - 1.1.94.2: Again fixes for the combination of 3.07/wlex - 1.1.94.1: Fixes for 3.07 concerning the pregenerated wlexers. - New: Pxp_document.build_node_tree - 1.1.94: The Pxp_reader module has been completely rewritten. This fixes some problems with relative URLs. - Pxp_yacc has been split up into four modules: Pxp_tree_parser contains now the parser API returning object trees, Pxp_dtd_parser is the parser API returning DTDs, Pxp_ev_parser is the event-based API, and Pxp_core_parser is the core of the parser. Pxp_yacc is still available as compatibility API. As part of the module redesign, Pxp_types includes now parts of its interface from Pxp_core_types_type. I hope this style of programming is comprehensible. - I think PXP can now compiled with CVS releases of O'Caml. - It is now possible to turn warnings into errors. - The event-based parser can now preprocess namespaces. Furthermore, there are normalization filters. - 1.1.93: This is a bugfix release. Sometimes files were not closed in previous versions, but now they are. There were debug statements in the pull parser code, I have removed them. Finally, some errors in the Makefiles have been corrected. - 1.1.92: The whole lexing stuff has been restructured. There is a new tool, lexpp, that generates the lexers from only five files. Furthermore, much more 8 bit character sets are now supported as internal encodings. In previous versions of PXP, the internal representation of the XML trees was restricted to either UTF-8 or ISO-8859-1. Now, a number of additional encodings are supported, including the whole ISO-8859 series. Bugfix: If the processing instruction <?xml...?> occurs in the middle of the XML document, version 1.1.91 will immediately stop parsing, and ignore the rest of the file. This is now fixed. - 1.1.91: The curly braces can now even be used inside attributes, and escape from normal XML parsing. There is a new entry point Entry_expr for event-based parsing that expects either a single element, a single processing instruction, or a single comment, or whitespace. This allows more fine-grained control of what is parsed. There is now a "pull parser". In contrast to the "push parser" introduced in 1.1.90, the calling order of parser and parser user have been inverted, i.e. the user calls the parser to get ("pull") the next event instead of letting the parser call back a user function ("push"). An interesting application is that O'Caml's lazy streams can be used to analyze events. An example can be found in examles/pullparser. Pull parsing is not yet well-tested! - 1.1.90: This version introduces a new event-based interface in Pxp_yacc. For start tags, end tags, data strings, and several other things that are found in the XML source so-called events are generated, and a user function is called for every event. See the directory examples/eventparser for examples. Another innovation is support for curly braces as escape characters. Inside elements, the left curly brace escapes from XML parsing and starts a foreign parser until the matching right curly brace is found: <element> ... { foreign syntax } ... </element> The curly braces are borrowed from the XQuery draft standard. They cannot yet be used inside attribute values. Curly braces are mostly useful in conjunction with event-based parsing, because it is not yet possible to include the "value" of the curly brace expression into XML trees. It is even possible to call the XML parser from the foreign parser as subparser. However, there not yet enough entry points for the event-based parser (e.g. you cannot parse just the following processing instruction, only misc* element misc* or whole documents are possible). A long-standing bug has been found in the entity layer. When an external entity A opens an external entity B, and B opens C, relative paths of C have been interpreted wrong. - Changed in 1.1.5: A packaging error in pxp-wlex has been corrected. (This is the only change.) - Changed in 1.1.4: This is a bigger bug fix release that addresses the following problems: The parser does no longer show very bad performance when large data nodes without line feeds are parsed. Another performance problem with bigger DTDs has been solved, too. Especially, the XHTML DTD can now be parsed quite quickly. The interface Pxp_dtd.Entity has been extended, it is now possible to access more properties of entities than before. Pxp_marshal has been revised. It is now possible to recode the input or output stream on the fly in order to change the character encoding. Furthermore, the functions relocate_subtree and relocate_document allows one to marshal an XML tree or a document, and to read the marshaled data immediately to create a copy of the original structure. Some errors have been fixed in from_file. Especially, this function will no longer raise the exception Malformed_URL if the current working directory happens to be "/". Pxp_document.strip_whitespace implements xml:space now correctly. In previous versions, xml:space='default' was ignored when it occurred inside an element with xml:space='preserve'. Now the inner xml:space='default' overrides the outer xml:space='preserve' as defined in the XML standard. - Changed in 1.1.3: This release fixes a single problem occurring when PXP is compiled with installed netstring-0.10. (There is not any problem with netstring-0.91.) - Changed in 1.1.2: Improved write method for whole documents. It can now also output a reference to an external DTD. PXP can be compiled with O'Caml 3.04. - Changed in 1.1.1: Minor changes for O'Caml 3.03-alpha. The interfaces have not been modified. - Changed in 1.1: The parser supports now namespaces. Extended/updated Pxp_document interface. There is now a separate class for every node type. It is now clear which node methods validate and which do not validate. The node tree can now be simpler modified (insert/delete). It is now possible to start in well-formedness mode and validate the XML tree later (as a whole, or partially). New functions for tree normalization, and whitespace stripping. The implementation of Pxp_document has been updated, too. There are now many virtual classes, one class for one task. The attribute representation has been improved. The overall size of the document tree has been reduced. The parser is better in counting lines. The option errors_with_line_numbers could be removed because the parser is now fast enough that it does make sense to always count lines. There are now string pools that can save memory in some situations. New module Pxp_marshal allows marshalling of XML trees over channels (faster than writing the tree and reparsing it). For the most important entity functions there is an interface Pxp_document.Entity. Although there are many extensions, the parser has been sped up. The parser has been divided up into several packages, and the directory structure of the distribution has been cleaned up. It is possible to chose among several lexical analyzers. One of them bases on Alain Frisch's wlex patch, which reduces the size of executables if a UTF-8 parser is needed. The parser works under Cygwin. Of course several bug fixes. Note that most bugs have been added in the development cycle between 1.0 and 1.1; only very few problems have been detected in the 1.0 release. I hope that 1.1 has similar quality. -------------------------- [1] see http://download.camlcity.org/download/pxp-1.2.1.tar.gz [2] see http://projects.camlcity.org/projects/ocamlnet.html [3] see http://projects.camlcity.org/projects/pxp.html [4] see http://projects.camlcity.org/projects/dl/pxp-1.2.1/doc/manual/html/index.html [5] see mailto:gerd@gerd-stolpmann.de [6] see http://caml.inria.fr/ [7] see http://www.w3.org/TR/1998/REC-xml-19980210.html