Plasma GitLab Archive
Projects Blog Knowledge

Intro_events



XML data as stream of events

In contrast to the tree mode (see Intro_trees), the parser does not return the complete document at once in event mode, but as a sequence of so-called events. The parser makes a number of guarantees about the structure of the emitted events, especially it is ensured that they conform to the well-formedness constraints. For instance, it is ensured that start tags and end tags are properly nested. Nevertheless, it is up to the caller to process and/or aggregate the events. This leaves a lot of freedom for the caller.

The event mode is especially well-suited for processing very large documents. As PXP does not by itself represent the complete document in memory, PXP needs usually not to maintain large data structures in event mode. Of course, the caller should also try to avoid such data structures. This makes it then possible to even process arbitrarily large documents in many cases. Note, however, that not all limits are taken out of effect. For example, for checking well-formedness the parser still needs to maintain a stack of start elements whose end elements have not been seen yet. Because of this, it is not possible to parse arbitrarily deeply nested documents with constant memory. On 32 bit platforms, there is still a limit of the maximum string length of 16 MB.

Another application of event mode is the direct combination with recursive-descent parsers for postprocessing the stream of events. See below Connect PXP with a recursive-descent parser for more.

The event mode also makes it feasible to enable the special escape tokens {, }, {{, and }}. PXP can be configured such that these tokens trigger a user-defined add-on parser that reads directly from the character stream. See below Escape PXP parsing for more.

We should also mention one basic limitation of event-oriented parsing: It is fundamentally incompatible with validation, as the tree view is required to validate a document.

Links to other documentation

Compatibility

Event mode is compatible with:

  • Well-formedness parsing
  • Namespaces: Namespace processing works as outlined in Intro_namespaces, only that the user needs to interpret the namespace information contained in the events differently. See below Events and namespaces for more.
  • Reading from arbitrary sources as described in Intro_resolution
Event mode is incompatible with:

  • Validation

The structure of event streams

First we describe how well-formed XML fragments are represented in stream format, i.e. XML text that is properly nested with respect to start tags and end tags. For a real text, the parser will also emit some wrapping. It is distinguished between documents and non-document entities. A document is a formally closed text that consists of one main entity (file) and optionally a number of referenced entities. One can parse a file as document, and in this case the parser will add a wrapping suited for documents. Alternatively, one can parse an entity as a plain entity, and in this case the parser will add a wrapping suited for non-documents. Note that the XML declaration (<?xml ... ?>) for such non-document entities is slightly different, and that no DOCTYPE clause is permitted.

The structure of well-formed XML fragments

The type of events is Pxp_types.event. The events do not strictly correspond to syntactical elements of XML, but more to a logical interpretation.

The parser emits events for

  • E_char_data(text): Character data - The parser emits character data events for sequences of characters. It is unspecified how long these sequences are. This means it is up to the parser how a contiguous section of characters is split up into one or more character data events, i.e. adjacent character data events may be emitted by the parser. Also, it is not tried to suppress whitespace of any kind. For example, the XML text
     Hello world 
    
      might lead to the emission of
      
     [E_char_data "Hello "E_char_data "world"
    
      but also to any other split into events.
    

  • E_start_tag(name,atts,scope_opt,entid): Start tags of elements - Includes everything within the angle brackets, i.e. name and attribute list atts (as name/value pairs). The event also includes the namespace scope scope_opt if namespace processing is enabled (or None), and it includes a reference entid to the entity the tag occurs in. Note that the tag name and the attribute names are subject to prefix normalization if namespace processing is enabled.

  • E_end_tag(name,entid): End tags of elements - The event mentions the name, and the entity entid the tag occurs in. Both name and entid are always identical to the values attached to the corresponding start tag.

    Note that the short form of empty elements, <tag/> are emitted as a start tag followed by an end tag.

  • E_pinstr(name,value,entid): Processing instructions (PI's) - In tree mode, PI's can be represented in two ways: Either by attaching them to the surrounding elements, or by including them into the tree exactly where they occurred in the text. For symmetry, the same two ways of handling PI's are also present in the event stream representation (event streams and trees should be convertible into each other without data loss). Although there is only one event (E_pinstr), it depends on the config option enable_pinstr_nodes where this event is placed into the event stream. If the option is enabled, E_pinstr is always emitted where the PI occurs in the XML text. If it is disabled, the emission of E_pinstr may be delayed, but it is still guaranteed that this happens in the same context (surrounding element). It is not possible to turn the emission of PI events completely off. (See Filters for an example how to filter out PI events in a postprocessing step.)

  • E_comment text: Comments - If enabled (by enable_comment_nodes in Pxp_types.config), the parser emits comment events.

  • E_start_super and E_end_super: Super root nodes - If enabled (by enable_super_root_node in Pxp_types.config), the parser emits a start event for the super root node at the beginning of the stream, and an end event at the end of the stream. This is comparable to an element embracing the whole text.

  • E_position(e,l,p): Position events - If enabled (by store_element_positions in Pxp_types.config), the parser emits special position events. These events refer to the immediately following event, and say from where in the XML text the following event originates. Position events are emitted before E_start_tag, E_pinstr, and E_comment. The argument e is a textual description of the entity. l is the line. p is the byte position of the character.

As in the tree mode, entities are fully resolved, and do not appear in the parsed events. Also, syntactic elements like CDATA sections, the XML declaration, the DOCTYPE clause, and all elements only allowed in the DTD part are not represented.

Example for an event stream: The XML fragment

  <p a1="one"><q>data1</q><r>data2</r><s></s><t/></p>

could be represented as

  [ E_start_tag("p",["a1","one"],None,<entid>);
    E_start_tag("q",[],None,<entid>);
    E_char_data "data1";
    E_end_tag("q",<entid>);
    E_start_tag("r",[],None,<entid>);
    E_char_data "data2";
    E_end_tag("r",<entid>);
    E_start_tag("s",[],None,<entid>);
    E_end_tag("s",<entid>);
    E_start_tag("t",[],None,<entid>);
    E_end_tag("t",<entid>);
    E_end_tah("p",<entid>);
  ]

where <entid> is the entity ID object.

The wrapping for non-document entities

The XML specification demands that external XML entities (that are referenced from a document entity or another external entity) comply to this grammar (excerpt from the W3C definition):

extParsedEnt ::= TextDecl? content
TextDecl     ::= '<?xml' VersionInfoEncodingDecl S'?>'
content      ::= (element | CharData | Reference | CDSect | PI | Comment)*

i.e. there can be an XML declaration at the beginning (always with an encoding declaration), but the declaration is optional. It is followed by a sequence of elements, character data, processing instructions and comments (which are reflected by the events emitted by the parser), and by entity references and CDATA sections (which are already resolved by the parser).

The emitted events are now:

  • No event is emitted for the XML declaration
  • The stream consists of the events for the content production
  • Finally, there is an E_end_of_stream event.
When the parser detects an error, it stops the event stream, and emits a last E_error event instead.

The wrapping for closed documents

Closed documents have to match this grammar (excerpt from the W3C definition):

document ::= prolog element Misc*
prolog          ::= XMLDeclMisc* (doctypedecl Misc*)?
XMLDecl  ::= '<?xml' VersionInfo EncodingDeclSDDeclS'?>'

That means there can be an XML declaration at the beginning (always with a VersionInfo declaration), but the declaration is optional. There can be a DOCTYPE declaration. Finally, there must be a single element. The production Misc stands for a comment, a processing instruction, or whitespace.

The emitted events are now:

  • E_start_doc(version,dtd) is always emitted at the beginning. The version string is from VersionInfo, or "1.0" if the whole XML declaration is missing. The dtd object may contain the declaration of the parsed DOCTYPE clause. However, by setting parsing parameters it is possible to control which declarations are added to the dtd object.
  • If enable_super_root: E_start_super
  • If there are comments or processing instructions before the topmost element, and the node type is enabled, these events are now emitted.
  • Now the events of the topmost element follow.
  • If there are comments or processing instructions after the topmost element, and the node type is enabled, these events are now emitted.
  • If enable_super_root: E_end_super
  • E_end_doc name: ends the document. The name is the literal name of the topmost element, without any prefix normalization even if namespace processing is enabled
  • Finally, there is an E_end_of_stream event.
When the parser detects an error, it stops the event stream, and emits a last E_error event instead.

Calling the parser in event mode

The parser returns the emitted events while it is parsing. There are two models for that:

  • Push parsing: The caller passes a callback function to the parser, and whenever the parser emits an event, this function is invoked
  • Pull parsing: The parser runs as a coroutine together with the caller. The invocation of the parser returns the pull function. The caller now repeatedly invokes the pull function to get the emitted events until the end of the stream is indicated.
Let's look at both models in detail by giving an example. There is some code that is needed in both push and pull parsing. This example is similar to the examples given in Intro_getting_started. First we need a Pxp_types.source that says from where the input to parse comes. Second, we need an entity manager (of the opaque PXP type Pxp_entity_manager.entity_manager). The entity manager is a device that controls the source and switches between the entities to parse (if such switches are necessary). The entity manager is visible to the caller in event mode - in tree mode it is also needed but hidden in the parser driver.

let config = Pxp_types.default_config
let source = Pxp_types.from_file "filename.xml"
let entmng = Pxp_ev_parser.create_entity_manager config source

(See also: Pxp_ev_parser.create_entity_manager.)

From here on, the required code differs in both parsing modes.

Push parsing

The function Pxp_ev_parser.process_entity invokes the parser in push mode:

let () = Pxp_ev_parser.process_entity config entry entmng (fun ev -> ...)

The callback function is here shown as (fun ev -> ...). It is called back for every emitted event ev (of type Pxp_types.event). It is ensured that the last emitted event is either E_end_of_stream or E_error. See the documentation of Pxp_ev_parser.process_entity for details about error handling.

The parameter entry (of type Pxp_types.entry) determines the entry point in the XML grammar. Essentially, it says what kind of thing to parse. Most users will want to pass `Entry_document here to parse a closed document. Note that the emitted event stream includes the wrapping for documents as described in The wrapping for closed documents.

The entry point `Entry_content is for non-document external entities, as described in The wrapping for non-document entities. There is a similar entry point, `Entry_element_content, which additionally enforces some constraints on the node structure. In particular, there must be a single top-level element so that the enforced node structure looks like a document. We do not recommend to use `Entry_element_content - rather use `Entry_document, and remove the document wrapping in a postprocessing step.

The entry point `Entry_expr reads a single node (see Pxp_types.entry for details). It is recommended to use Pxp_ev_parser.process_expr instead of Pxp_ev_parser.process_entity together with this entry point, as this allows to start and end parsing within an entity, instead of having to parse an entity as a whole. (This is intended for special applications only.)

The entry point `Entry_declarations is currently unused.

Flags for `Entry_document. This entry point takes some flags as arguments that determine some details. It is usually ok to just pass the empty list of flags, i.e. `Entry_document []. The flags may enable some validation checks, or at least configure that some data is stored in the DTD object so that it is available for a later validation pass. Remember that the event mode by itself can only do well-formedness parsing. It can be reasonable, however, to enable flags when the event stream is later validated by some other means (e.g. by converting it into a tree and validating it).

Pull parsing

The pull parser is created by Pxp_ev_parser.create_pull_parser like:

let pull = create_pull_parser config entry entmng

The arguments config, entry, and entmng have the same meaning as for the push parser. In the case of the pull parser, however, no callback function is passed by the user. Instead, the return value pull is a function one can call to "pull" the events out of the parser engine. The pull function returns Some ev where ev is the event of type Pxp_types.event. After the end of the stream is reached, the function returns None.

Essentially, the parser works like an engine that can be started and stopped. When the pull function is invoked, the parser engine is "turned on", and runs for a while until (at least) the next event is available. Then, the engine is stopped again, and the event is returned. The engine keeps its state between invocations of pull so that the parser continues exactly at the point where it stopped the last time.

Note that files and other resources of the operating system are kept open while parsing is in progress. It is expected by the user to continue calling push until the end of the stream is reached (at least until Some E_end_of_stream, Some E_error, or None is returned by pull). See the description of Pxp_ev_parser.close_entities for a way of prematurely closing the parser for the exceptional cases where parsing cannot go on until the final parser state is reached.

Preprocessor

The PXP preprocessor (see Intro_preprocessor) allows one to create event streams programmatically. One can get the events either as list (type Pxp_types.event list), or in a form compatible with pull parsing. For example,

let book_list = 
  <:pxp_evlist< 
    <book>
      [ <title>[ "The Lord of The Rings" ]
        <author>[ "J.R.R. Tolkien" ]
      ]
  >>

returns the events as a Pxp_types.event list whereas

let pull_book = 
  <:pxp_evpull< 
    <book>
      [ <title>[ "The Lord of The Rings" ]
        <author>[ "J.R.R. Tolkien" ]
      ]
  >>

defines pull_book as an automaton from which one can pull the events like from a pull parser, i.e. pull_book is of type unit->Pxp_types.event option, and by calling it one can get the events one after the other. pull_book has the same type as the pull function returned by the pull parser.

For a more complete discussion see Generating events: pxp_evlist and pxp_evpull.

Note that the preprocessor does not add any wrapping for documents or non-documents to the event stream. See Documents for an example how to add such a wrapping in user code postprocessing step.

Push or pull?

The question arises whether one should prefer the push or the pull model. Generally, it is easy to turn a pull parser into a push parser by adding a loop that repeatedly invokes pull to get the events, and then calls the push function to deliver each event. There is no such possibility the other way round, i.e. one cannot take a push parser and make it look like a pull parser by wrapping it into some interface adapter - at least not in a language like O'Caml that does not know coroutines or continuations as language elements. Effectively, the pull model is the more general one.

The function Pxp_event.iter can be used to turn a pull parser into a push parser:

Pxp_event.iter push pull

The events pull-ed out of the parser engine are delivered one by one to the receiver by invoking push.

In PXP, the pull model is preferred, and a number of helper functions are only available for the pull model. If you need a push-stream nevertheless, it is recommended to use the pull parser, and to do all required transformations on it (like filtering, see below). Finally use Pxp_event.iter to turn the pull stream into a push-compatible stream.

Filters

Filters are a way to transform event streams (as defined for pull parsers). For example, one can remove the processing instruction events by doing (given that pull is the original parser, and we define now a modified pull' for the transformed stream):

let pull' = Pxp_event.pfilter
              (function
                | E_pinstr(_,_,_) -> false
                | _ -> true
              )
              pull

When events are read from pull', the events are also read from pull, but all processing instruction events are suppressed. Pxp_event.pfilter works a lot like List.filter - it only keeps the events in the stream for which a predicate function returns true.

Normalizing character data events

Pxp_event.norm_cdata_filter is a special predefined filter that transformes E_char_data events so that

  • empty E_char_data events are removed
  • adjacent E_char_data events are concatenated and replaced by a single E_char_data event
The filter is simply called by

let pull' = Pxp_event.norm_cdata_filter pull

Removing ignorable whitespace

In validation mode, the DTD may specify ignorable whitespace. This is whitespace for which is known it only exists to make the XML tree more readable (indentation etc.). In tree mode, ignorable whitespace is removed by default (see drop_ignorable_whitespace in Pxp_types.config).

It is possible to clean up the event stream in this way - although the event mode is not capable of doing a full validation of the XML document. It is required, however, that all declarations are added to the DTD object. This is done by setting the flags `Extend_dtd_fully or `Val_mode_dtd in the entry point, e.g. use

let entry = `Entry_document [`Extend_dtd_fully]

when you create the pull parser. The declarations of the XML elements are needed to check whether whitespace can be dropped.

The filter function is Pxp_event.drop_ignorable_whitespace_filter. Use it like

let pull' = Pxp_event.drop_ignorable_whitespace_filter pull

This filter does:

  • it checks whether non-whitespace is used in forbidden places, e.g. as children of an element that is declared with a regular expression content model
  • it removes E_char_data events only consisting of whitespace when they are ignorable.
The stream remains being normalized if it was already normalized, i.e. you can use this filter before or after Pxp_event.norm_cdata_filter.

Unwrapping documents

Sometimes it is necessary to get rid of the document wrapping. The filter Pxp_event.unwrap_document can do this. Call it like:

let get_doc_details, pull' = Pxp_event.unwrap_document pull

The filter removes all E_start_doc, E_end_doc, E_start_super, E_end_super, and E_end_of_stream events. Also, when an E_error event is encountered, the attached exception is raised. The information attached to the removed E_start_doc event can be retrieved by calling get_doc_details:

let xml_version, dtd = get_doc_details()

Note that this call will fail if there is no E_start_doc, and it can fail if it is not at the expected position in the stream. If you parse with the entry `Entry_document, this cannot happen, though.

It is allowed to call get_doc_details before using pull'.

Chaining filters

It is allowed to chain filters, e.g.

let pull1 = Pxp_event.drop_ignorable_whitespace_filter pull
let pull2 = Pxp_event.norm_cdata_filter pull1

Other helper functions

In Pxp_event there are also other helper functions besides filters. These functions can do:

  • conversion of pull streams to and from lists
  • concatenation of pull streams
  • extraction of nodes from pull streams
  • printing of pull streams
  • split namespace names

Events and namespaces

Namespace processing can also be enabled in event mode. This means that prefix normalization is applied to all names of elements and attributes. For example, this piece of code parses a file in event mode with enabled namespace processing:

  let nsmng = Pxp_dtd.create_namespace_manager()
  let config = 
        { Pxp_types.default_config with
             enable_namespace_processing = Some nsmng
        }
  let source = ...
  let entmng = Pxp_ev_parser.create_entity_manager config source
  let pull = create_pull_parser config entry entmng

The names returned in E_start_tag(name,attlist,scope_opt,entid) are prefix-normalized, i.e. name and the attribute names in attlist. The functions Pxp_event.namespace_split and Pxp_event.extract_prefix can be useful to analyze the names. For example, to get the namespace URI of an element name, do:

  match ev with
    | Pxp_types.E_start_tag(name,_,_,_) ->
        let prefix = Pxp_event.extract_prefix name in
        let uri = nsmng # get_primary_uri prefix in
        ...

Note that this may raise the exception Namespace_prefix_not_managed if the prefix is unknown or empty.

When namespace processing is enabled, the namespace scopes are included in the E_start_tag events. This can be used to get the display (original) prefix:

  match ev with
    | Pxp_types.E_start_tag(name,_,Some scope,_) ->
        let prefix = Pxp_event.extract_prefix name in
        let dsp_prefix = scope # display_prefix_of_normprefix prefix in
        ...

Note that this may raise the exception Namespace_prefix_not_managed if the prefix is unknown or empty, or Namespace_not_in_scope if the prefix is not declared for this part of the XML text.

Example: Print the events while parsing

The following piece of code parses an XML file in event mode, and prints the events. The reader is encouraged to modify the code by e.g. adding filters, to see the effect.

  let config = Pxp_types.default_config
  let source = Pxp_types.from_file "filename.xml"
  let entmng = Pxp_ev_parser.create_entity_manager config source
  let pull = create_pull_parser config entry entmng
  let () = Pxp_event.iter
             (fun ev -> print_endline (Pxp_event.string_of_event ev))
             pull

Connect PXP with a recursive-descent parser

We assume here that a list of integers like

   43 :: 44 :: []

is represented in XML as

  <list><cons><int>43</int><cons><int>44</int><nil/></cons></cons></list>

i.e. we have

  • list indicates that the single child is a list
  • cons has two children: the first is the head of the list, and the second the tail (think head :: tail in O'Caml)
  • nil is the empty list
  • int is an integer member of the list
We want to parse such XML texts by using the event-oriented parser, and combine it with a recursive-descent grammar. The XML parser delivers events which are taken as the tokens of the second parser.

let parse_list (s:string) =

  let rec parse_whole_list stream =
    (* Production:
         whole_list ::= "<list>" sub_list "</list>" END
     *)

    match stream with parser
        [< 'E_start_tag("list",_,_,_);
           l = parse_sub_list;
           'E_end_tag("list",_);
           'E_end_of_stream;
        >] ->
          l

  and parse_sub_list stream =
    (* Production:
         sub_list ::= "<cons>" object sub_list "</cons>"
                    | "<nil>" "</nil>"
     *)

    match stream with parser
        [< 'E_start_tag("cons",_,_,_); 
           head = parse_object;
           tail = parse_sub_list;
           'E_end_tag("cons",_)
        >] ->
          head :: tail
          
      | [< 'E_start_tag("nil",_,_,_); 'E_end_tag("nil",_) >] ->
          []

  and parse_object stream =
    (* Production:
         object ::= "<int>" text "</int>"
       with constraint that text is an integer parsable by int_of_string
     *)

    match stream with parser
        [< 'E_start_tag("int",_,_,_);
           number = parse_text;
           'E_end_tag("int",_)
        >] ->
          int_of_string number

  and parse_text stream =
    (* Production.
         text ::= "any XML character data"
     *)

    match stream with parser
        [< 'E_char_data data;
           rest = parse_text
        >] ->
          data ^ rest
      | [< >] ->
          ""
  in

  let config = 
    { Pxp_types.default_config with
        store_element_positions = false;
          (* don't produce E_position events *)
    }
  in
  let mgr = 
     Pxp_ev_parser.create_entity_manager
       config
       (Pxp_types.from_string s) in
  let pull = 
    Pxp_ev_parser.create_pull_parser config (`Entry_content[]) mgr in
  let pull' =
    Pxp_event.norm_cdata_filter pull in
  let next_event_or_error n =
    let e = pull' n in
    match e with
        Some(E_error exn) -> raise exn
      | _ -> e
  in
  let stream =
    Stream.from next_event_or_error in
  parse_whole_list stream

The trick is to use Stream.from to convert the "pull-style" event stream into a Stream.t. The kind of stream can be parsed in a recursive-descent way by using stream parser capability built into O'Caml.

Note that we normalize the character data nodes. The grammar can only process a single E_char_data event, and this normalization enforces that adjacent E_char_data events are merged.

Note that you have to enable camlp4 when compiling this example, because the stream parsers are only available via camlp4.

Escape PXP parsing

This feature is still considered as experimental!

It is possible to define two escaping functions in Pxp_types.config:

  • escape_contents: This function is called when one of the characters {, }, {{, or }} is found in character data context.
  • escape_attributes: This function is called when one of the mentioned special characters is found in the value of an attribute.
Both escaping functions are allowed to operate directly on the underlying lexical buffer PXP uses, and because of this these functions can interpret the following characters in an arbitrary special way. The escaping functions have to return a replacement text, i.e. a string that is to be taken as character data or as attribute value (depending on context).

Why are the curly braces taken as escaping characters? This is motivated by the XQuery language. Here, a single { switches from the XML object language to the XQuery meta language until another } terminates this mode. By doubling the brace character, it loses its escaping function, and a single brace character is assumed.

A simple example makes this clearer. We allow here that a number is written between curly braces in hexadecimal, octal or binary notation using the conventions of O'Caml. The number is inserted into the event stream in normalized decimal notation (i.e. no leading zeros). For instance, one can write

  <foo x="{0xff}" y="{{}}">{0o76}</foo>

and the parser emits the events

    E_start_tag("foo", ["x""255""y""{}" ], _, _)
    E_char_data("62")
    E_end_tag("foo",_)

Of course, this example is very trivial, and in this case, one could also get the same effect by postprocessing the XML events. We want to point out, however, that the escaping feature makes it possible to combine PXP with a foreign language with its own lexing and parsing functions.

First, we need a lexer - this is lex.mll:

  rule scan_number = parse
   | [ '0'-'9' ]+ 
        { `Int (int_of_string (Lexing.lexeme lexbuf)) }
   | ("0b"|"0B") [ '0'-'1' ]+ 
        { `Int (int_of_string (Lexing.lexeme lexbuf)) }
   | ("0o"|"0O") [ '0'-'7' ]+ 
        { `Int (int_of_string (Lexing.lexeme lexbuf)) }
   | ("0x"|"0X") [ '0'-'9' 'a'-'f' 'A'-'F' ]+ 
        { `Int (int_of_string (Lexing.lexeme lexbuf)) }
   | "}" 
        { `End }
   | _
        { `Bad }
   | eof
        { `Eof }

This lexer parses the various forms of numbers. We are lucky that we can use int_of_string to convert these forms to ints. The right curly brace is also recognized. Any other character leads to a lexing error (`Bad). If the XML file stops, `Eof is emitted.

Now the escape functions. escape_contents looks at the passed token. If it is a double curly brace, it immediately returns a single brace as replacement. A single left brace is processed by parse_number, defined below. A single right brace is forbidden. Any other tokens cannot be passed to escape_contents. escape_attributes has an additional argument, but we can ignore this for now. (This argument is the position in the attribute value, for advanced post-processing.)

  let escape_contents tok mng =
    match tok with
      | Lcurly (* "{" *) ->
          parse_number mng
      | LLcurly (* "{{" *) ->
          "{"
      | Rcurly (* "}" *) ->
          failwith "Single } not allowed"
      | RRcurly (* "}}" *) ->
          "}"
      | _ ->
          assert false

  let escape_attributes tok pos mng =
    escape_contents tok mng

Now, parse_number invokes our custom lexer Lex.scan_number with the (otherwise) internal PXP lexbuf. The function returns the replacement text.

It is part of the interface that the next token of the lexbuf must be the character following the right curly brace.

  let parse_number mng =
    let lexbuf = 
       match mng # current_lexer_obj # lexbuf with
         | `Ocamllex lexbuf -> lexbuf
         | `Netulex _ -> failwith "Netulex lexbufs not supported" in
    match Lex.scan_number lexbuf with
      | `Int n ->
           let s = string_of_int n in
           ( match  Lex.scan_number lexbuf with
               | `Int _ ->
                    failwith "More than one number"
               | `End ->
                    ()
               | `Bad ->
                    failwith "Bad character"
               | `Eof ->
                    failwith "Unexpected EOF"
           );
           s
      | `End ->
           failwith "Empty curly braces"
      | `Bad ->
           failwith "Bad character"
      | `Eof ->
           failwith "Unexpected EOF"

Due to the way PXP works internally, the method mng # current_lexobj # lexbuf can return two different kinds of lexical buffers. `Ocamllex means it is a Lexing.lexbuf buffer. This type of buffer is used for all 8 bit encodings, and if the special pxp-lex-utf8 lexer is used. The lexer pxp-ulex-utf8, however, will return a Netulex-style buffer.

Finally, we enable to use our escaping functions in the config record:

let config =
     { Pxp_types.default_config with
         escape_contents = escape_contents;
         escape_attributes = escape_attributes

How a complex example could work

The mentioned example is simple because the return value is a string. One can imagine, however, complex scenarios where one wants to insert custom events into the event stream. The PXP interface does not allow this directly. As workaround we suggest the following.

The custom events are collected in special buffers. The buffers are numbered by sequential integers (0, 1, ...). So escape_contents would allocate such a buffer and get a number:

  let buffer, n = allocate_event_buffer()

Here, buffer could be an event Queue.t. The number n identifies the buffer. The buffers, once filled, can be looked up by

  let buffer = lookup_event_buffer n

So escape_contents would like to return the events collected in the buffer, so that these are inserted into the event stream at the position where the curly escape occurs. As this is not allowed, it returns simply the buffer number instead so that it can be later identified, e.g.

  "{BUFFER " ^ string_of_int n ^ "}"

For unescaping curly braces one would insert special tokens, e.g. "{LCURLY}" and "{RCURLY}".

Now, the parser, specially configured with escape_contents, will return event streams where E_char_data events may include this special pointers to buffers {BUFFER <n>}, and the curly brace tokens {LCURLY} and {RCURLY}. In a postprocessing step, all occurrences of these tokens are localized in the event stream, and

  • for buffer tokens the buffer contents are looked up (lookup_event_buffer), and the events found there are substituted
  • for {LCURLY} an E_char_data "{" event is substituted
  • for {RCURLY} an E_char_data "}" event is substituted
It can be assumed that the tokens to localize are still E_char_data events of their own, i.e. not merged with adjacent E_char_data events.

It is admitted that this is a complicated workaround.

For attributes one can do basically the same. The postprocessing step may be a lot more complicated, however.

This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml