Plasma GitLab Archive
Projects Blog Knowledge

4.4. Invoking the parser

Here a description of Pxp_yacc.

4.4.1. Defaults

The following defaults are available:

val default_config : config
val default_namespace_config : config
val default_extension : ('a node extension) as 'a
val default_spec : ('a node extension as 'a) spec
val default_namespace_spec : ('a node extension as 'a) spec

4.4.2. Parsing functions

In the following, the term "closed document" refers to an XML structure like

<!DOCTYPE ... [ declarations ] >
<root>
...
</root>
The term "fragment" refers to an XML structure like
<root>
...
</root>
i.e. only to one isolated element instance.

val parse_dtd_entity : config -> source -> dtd
Parses the declarations which are contained in the entity, and returns them as dtd object.

val extract_dtd_from_document_entity : config -> source -> dtd
Extracts the DTD from a closed document. Both the internal and the external subsets are extracted and combined to one dtd object. This function does not parse the whole document, but only the parts that are necessary to extract the DTD.

val parse_document_entity : 
    ?transform_dtd:(dtd -> dtd) ->
    ?id_index:('ext index) ->
    config -> 
    source -> 
    'ext spec -> 
        'ext document
Parses a closed document and validates it against the DTD that is contained in the document (internal and external subsets). The option ~transform_dtd can be used to transform the DTD in the document, and to use the transformed DTD for validation. If ~id_index is specified, an index of all ID attributes is created.

val parse_wfdocument_entity : 
    config -> 
    source -> 
    'ext spec -> 
        'ext document
Parses a closed document, but checks it only on well-formedness.

val parse_content_entity  : 
    ?id_index:('ext index) ->
    config ->  
    source -> 
    dtd -> 
    'ext spec -> 
        'ext node
Parses a fragment, and validates the element.

val parse_wfcontent_entity : 
    config -> 
    source -> 
    'ext spec -> 
        'ext node
Parses a fragment, but checks it only on well-formedness.

4.4.3. Configuration options

type config =
    { warner : collect_warnings;
      enable_pinstr_nodes : bool;
      enable_super_root_node : bool;
      enable_comment_nodes : bool;
      drop_ignorable_whitespace : bool;
      encoding : rep_encoding;
      recognize_standalone_declaration : bool;
      store_element_positions : bool;
      idref_pass : bool;
      validate_by_dfa : bool;
      accept_only_deterministic_models : bool;
      enable_namespace_processing : Pxp_dtd.namespace_manager option;
      ...
    }

  • warner:The parser prints warnings by invoking the method warn for this warner object. (Default: all warnings are dropped)

  • enable_pinstr_nodes:If true, the parser creates extra nodes for processing instructions. If false, processing instructions are simply added to the element or document surrounding the instructions. (Default: false)

  • enable_super_root_node:If true, the parser creates an extra node which is the parent of the root of the document tree. This node is called super root; it is an element with type T_super_root. - If there are processing instructions outside the root element and outside the DTD, they are added to the super root instead of the document. - If false, the super root node is not created. (Default: false)

  • enable_comment_nodes:If true, the parser creates nodes for comments with type T_comment; if false, such nodes are not created. (Default: false)

  • drop_ignorable_whitespace: Specifies to drop whitespace occuring in elements that are declared not to contain characters. (Default: true)

  • encoding:Specifies the internal encoding of the parser. Most strings are then represented according to this encoding; however there are some exceptions (especially ext_id values which are always UTF-8 encoded). (Default: `Enc_iso88591)

  • recognize_standalone_declaration: If true and if the parser is validating, the standalone="yes" declaration forces that it is checked whether the document is a standalone document. - If false, or if the parser is in well-formedness mode, such declarations are ignored. (Default: true)

  • store_element_positions: If true, for every non-data node the source position is stored. If false, the position information is lost. If available, you can get the positions of nodes by invoking the position method. (Default: true)

  • idref_pass:If true and if there is an ID index, the parser checks whether every IDREF or IDREFS attribute refer to an existing node; this requires that the parser traverses the whole doument tree. If false, this check is left out. (Default: false)

  • validate_by_dfa:If true and if the content model for an element type is deterministic, a deterministic finite automaton is used to validate whether the element contents match the content model of the type. If false, or if a DFA is not available, a backtracking algorithm is used for validation. (Default: true)

  • accept_only_deterministic_models: If true, only deterministic content models are accepted; if false, any syntactically correct content models can be processed. (Default: true)

  • enable_namespace_processing: By setting this to Some (new namespace_manager) namespace processing is enabled. The DTD will be initialized with the passed namespace manager. (Default: None for default_config, Some m for default_namespace_config)

4.4.4. Which configuration should I use?

First, I recommend to vary the default configuration instead of creating a new configuration record. For instance, to set idref_pass to true, change the default as in:

let config = { default_config with idref_pass = true }
The background is that I can add more options to the record in future versions of the parser without breaking your programs.

To start with namespace processing, use default_namespace_config and default_namespace_spec.

Do I need extra nodes for processing instructions? By default, such nodes are not created. This does not mean that the processing instructions are lost; however, you cannot find out the exact location where they occur. For example, the following XML text

<x><?pi1?><y/><?pi2?></x> 
will normally create one element node for x containing one subnode for y. The processing instructions are attached to x in a separate hash table; you can access them using x # pinstr "pi1" and x # pinstr "pi2", respectively. The information is lost where the instructions occur within x.

If the option enable_pinstr_nodes is turned on, the parser creates extra nodes pi1 and pi2 such that the subnodes of x are now:

x # sub_nodes = [ pi1; y; pi2 ]
The extra nodes contain the processing instructions in the usual way, i.e. you can access them using pi1 # pinstr "pi1" and pi2 # pinstr "pi2", respectively.

Note that you will need an exemplar for the PI nodes (see make_spec_from_alist).

Do I need a super root node? By default, there is no super root node. The document object refers directly to the node representing the root element of the document, i.e.

doc # root = r
if r is the root node. This is sometimes inconvenient: (1) Some algorithms become simpler if every node has a parent, even the root node. (2) Some standards such as XPath call the "root node" the node whose child represents the root of the document. (3) The super root node can serve as a container for processing instructions outside the root element. Because of these reasons, it is possible to create an extra super root node, whose child is the root node:
doc # root = sr         &&
sr # sub_nodes = [ r ]
When extra nodes are also created for processing instructions, these nodes can be added to the super root node if they occur outside the root element (reason (3)), and the order reflects the order in the source text.

Note that you will need an exemplar for the super root node (see make_spec_from_alist).

What is the effect of the UTF-8 encoding? By default, the parser represents strings (with few exceptions) as ISO-8859-1 strings. These are well-known, and there are tools and fonts for this encoding.

However, internationalization may require that you switch over to UTF-8 encoding. In most environments, the immediate effect will be that you cannot read strings with character codes >= 160 any longer; your terminal will only show funny glyph combinations. It is strongly recommended to install Unicode fonts (GNU Unifont, Markus Kuhn's fonts) and terminal emulators that can handle UTF-8 byte sequences. Furthermore, a Unicode editor may be helpful (such as Yudit). There are also FAQ by Markus Kuhn.

By setting encoding to `Enc_utf8 all strings originating from the parsed XML document are represented as UTF-8 strings. This includes not only character data and attribute values but also element names, attribute names and so on, as it is possible to use any Unicode letter to form such names. Strictly speaking, PXP is only XML-compliant if the UTF-8 mode is used; otherwise it will have difficulties when validating documents containing non-ISO-8859-1-names.

This mode does not have any impact on the external representation of documents. The character set assumed when reading a document is set in the XML declaration, and character set when writing a document must be passed to the write method.

How do I check that nodes exist which are referred by IDREF attributes? First, you must create an index of all occurring ID attributes:

let index = new hash_index
This index must be passed to the parsing function:
parse_document_entity
  ~id_index:(index :> index)
  config source spec
Next, you must turn on the idref_pass mode:
let config = { default_config with idref_pass = true }
Note that now the whole document tree will be traversed, and every node will be checked for IDREF and IDREFS attributes. If the tree is big, this may take some time.

What are deterministic content models? These type of models can speed up the validation checks; furthermore they ensure SGML-compatibility. In particular, a content model is deterministic if the parser can determine the actually used alternative by inspecting only the current token. For example, this element has non-deterministic contents:

<!ELEMENT x ((u,v) | (u,y+) | v)>
If the first element in x is u, the parser does not know which of the alternatives (u,v) or (u,y+) will work; the parser must also inspect the second element to be able to distinguish between the alternatives. Because such look-ahead (or "guessing") is required, this example is non-deterministic.

The XML standard demands that content models must be deterministic. So it is recommended to turn the option accept_only_deterministic_models on; however, PXP can also process non-deterministic models using a backtracking algorithm.

Deterministic models ensure that validation can be performed in linear time. In order to get the maximum benefits, PXP also implements a special validator that profits from deterministic models; this is the deterministic finite automaton (DFA). This validator is enabled per element type if the element type has a deterministic model and if the option validate_by_dfa is turned on.

In general, I expect that the DFA method is faster than the backtracking method; especially in the worst case the DFA takes only linear time. However, if the content model has only few alternatives and the alternatives do not nest, the backtracking algorithm may be better.

This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml