Plasma GitLab Archive
Projects Blog Knowledge

Intro_namespaces


This text explains how PXP deals with the optional namespace declarations in XML text.

Namespaces

PXP supports namespaces (but they have to be explicitly enabled). In order to simplify the handling of namespace-aware documents PXP applies a transformation to the document which is called "prefix normalization". This transformation ensures that every namespace prefix uniquely identifies a namespace throughout the whole document.

Links to other documentation

Namespace URI's and prefixes

A namespace is identified by a namespace URI (e.g. something like "http://company.org/namespaces/project1" - note that this URI is simply processed as string, and never looked up by an HTTP access). For brevity of formulation, one has to define a so-called namespace prefix for such a URI. For example:

 <x:q xmlns:x="http://company.org/namespaces/project1">...</q> 

The "xmlns:x" attribute is special, and declares that for this subtree the prefix "x" is to be used as replacement for the long URI. Here, "x:q" denotes that the element "q" in this namespace "x" is meant.

The problem is now that the URI defines the namespace, and not the prefix. In another subtree you may want to use the prefix "y" for the same namespace. This has always made it difficult to deal with namespaces in XML-processing software.

PXP, however, performs prefix normalization before it returns the tree. This means that all prefixes are changed to a norm prefix for the namespace. This can be the first prefix used for the namespace, or a prefix declared with a PXP extension, or a programmatically declared binding of the norm prefix to the namespace.

In order to use the PXP implementation of namespaces, one has to set enable_namespace_processing in the parser configuration, and to use namespace-aware node implementations. If you don't use extended node trees, this means to use Pxp_tree_parser.default_namespace_spec instead of Pxp_tree_parser.default_spec. A good starting point to enable all that:

  let nsmng = Pxp_dtd.create_namespace_manager()
  let config = 
        { Pxp_types.default_config with
             enable_namespace_processing = Some nsmng
        }
  let source = ...
  let spec = Pxp_tree_parser.default_namespace_spec
  let doc = Pxp_tree_parser.parse_document_entity config source spec
  let root = doc#root

The namespace-aware implementations of the node class type define additional namespace methods like namespace_uri (see Pxp_document.node.namespace_uri). (Although you also could direct the parser to create non-namespace-aware nodes, this does not make much sense, as you do not get these special access methods then.)

The method namespace_scope (see Pxp_document.node.namespace_scope) allows one to get more information what happened during prefix normalization. In particular, it is possible to find out the original prefix in the XML text (which is also called display prefix), before it was mapped to the normalized prefix. The namespace_scope method returns a Pxp_dtd.namespace_scope object with additional lookup methods.

Example for prefix normalization

In the following XML snippet the prefix "h" is declared as a shorthand for the XHTML namespace:

<h:html xmlns:h="http://www.w3.org/1999/xhtml"
  <h:head>
    <h:title>Virtual Library</h:title> 
  </h:head> 
  <h:body> 
    <h:p>Moved to <h:a href="http://vlib.org/">vlib.org</h:a>.</h:p> 
  </h:body> 
</h:html>

In this example, normalization changes nothing, because the prefix "h" has the same meaning thoughout the whole document. However, keep in mind that every author of XHTML documents can freely choose the prefix to use.

The XML standard gives the author of the document even the freedom to change the meaning of a prefix at any time. For example, here the prefix "x" is changed in the inner node:

<x:address xmlns:x="http://addresses.org">
  <x:name xmlns:x="http://names.org">
    Gerd Stolpmann
  </x:name>
</x:address>

In the outer node the prefix "x" is connected with the "http://addresses.org" namespace, but in the inner node it is connected with "http://names.org".

After normalization, the prefixes would look as follows:

<x:address xmlns:x="http://addresses.org">
  <x1:name xmlns:x1="http://names.org">
    Gerd Stolpmann
  </x1:name>
</x:address>

In order to avoid overridden prefixes, the prefix in the inner node was changed to "x1" (for type theorists: think of alpha conversion).

The idea of prefix normalization is to simplify how programs can match against element and attribute names. It is possible to configure the normalizer so that certain prefixes are used for certain URI's. In this example, we could direct the normalizer to use the prefixes "addr" and "nm" instead of the quite arbitrary strings "x" and "x1":

dtd # namespace_manager # add_namespace "addr" "http://addresses.org";
dtd # namespace_manager # add_namespace "nm" "http://names.org";

For this to work you need access to the dtd object before the parser actually starts it work. The parsing functions in Pxp_tree_parser have the special hook transform_dtd that is called at the right moment, and allows the program to enter such special configurations into the DTD object. The resulting program could look then like:

  let nsmng = Pxp_dtd.create_namespace_manager()
  let config = 
        { Pxp_types.default_config with
             enable_namespace_processing = Some nsmng
        }
  let source = ...
  let spec = Pxp_tree_parser.default_namespace_spec
  let transform_dtd dtd =
    dtd # namespace_manager # add_namespace "addr" "http://addresses.org";
    dtd # namespace_manager # add_namespace "nm" "http://names.org";
    dtd
  let doc = 
     Pxp_tree_parser.parse_document_entity ~transform_dtd config source spec
  let root = doc#root

Alternatively, it is also possible to put special processing instructions into the DTD:

<?pxp:dtd namespace prefix="addr" uri="http://addresses.org"?>
<?pxp:dtd namespace prefix="nm" uri="http://names.org"?>

The advantage of configuring specific normprefixes is that one can now use them directly in programs, e.g. for matching:

  match node#node_type with
    | T_element "addr:address" -> ...
    | T_element "nm:name" -> ...

Getting more details of namespaces

There are two additional objects that are relevant. First, there is a namespace manager for the whole tree. This object gathers all namespace URI's up that occur in the XML text, and decides which normprefixes are associated with them: Pxp_dtd.namespace_manager.

Second, there is the namespace scope. An XML tree may have a lot of such objects. A new scope object is created whenever new namespaces are introduced, i.e. when there are "xmlns" declarations. The scope object has a pointer to the scope object for the surrounding XML text. Scope objects are documented here: Pxp_dtd.namespace_scope.

Some examples (when n is a node):

  • To find out which normprefix is used for a namespace URI, use
     n # namespace_manager # get_normprefix uri 
  • To find out the reverse, i.e. which URI is represented by a certain normprefix, use
     n # namespace_manager # get_primary_uri prefix 
  • To find out which namespace URI is meant by a display prefix, i.e. the prefix as it occurs literally in the XML text:
     n # namespace_scope # uri_of_display_prefix prefix 

This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml