Plasma GitLab Archive
Projects Blog Knowledge

4.2. Resolvers and sources

4.2.1. Using the built-in resolvers (called sources)

The type source enumerates the two possibilities where the document to parse comes from.

type source =
    Entity of ((dtd -> Pxp_entity.entity) * Pxp_reader.resolver)
  | ExtID of (ext_id * Pxp_reader.resolver)
You normally need not to worry about this type as there are convenience functions that create source values:

  • from_file s: The document is read from file s; you may specify absolute or relative path names. The file name must be encoded as UTF-8 string.

    There is an optional argument ~system_encoding specifying the character encoding which is used for the names of the file system. For example, if this encoding is ISO-8859-1 and s is also a ISO-8859-1 string, you can form the source:

    let s_utf8  =  recode_string ~in_enc:`Enc_iso88591 ~out_enc:`Enc_utf8 s in
    from_file ~system_encoding:`Enc_iso88591 s_utf8

    This source has the advantage that it is able to resolve inner external entities; i.e. if your document includes data from another file (using the SYSTEM attribute), this mode will find that file. However, this mode cannot resolve PUBLIC identifiers nor SYSTEM identifiers other than "file:".

  • from_channel ch: The document is read from the channel ch. In general, this source also supports file URLs found in the document; however, by default only absolute URLs are understood. It is possible to associate an ID with the channel such that the resolver knows how to interpret relative URLs:

    from_channel ~id:(System "file:///dir/dir1/") ch
    There is also the ~system_encoding argument specifying how file names are encoded. - The example from above can also be written (but it is no longer possible to interpret relative URLs because there is no ~id argument, and computing this argument is relatively complicated because it must be a valid URL):
    let ch = open_in s in
    let src = from_channel ~system_encoding:`Enc_iso88591 ch in
    ...;
    close_in ch
  • from_string s: The string s is the document to parse. This mode is not able to interpret file names of SYSTEM clauses, nor it can look up PUBLIC identifiers.

    Normally, the encoding of the string is detected as usual by analyzing the XML declaration, if any. However, it is also possible to specify the encoding directly:

    let src = from_string ~fixenc:`ISO-8859-2 s
  • ExtID (id, r): The document to parse is denoted by the identifier id (either a SYSTEM or PUBLIC clause), and this identifier is interpreted by the resolver r. Use this mode if you have written your own resolver.

    Which character sets are possible depends on the passed resolver r.

  • Entity (get_entity, r): The document to parse is returned by the function invocation get_entity dtd, where dtd is the DTD object to use (it may be empty). Inner external references occuring in this entity are resolved using the resolver r.

    Which character sets are possible depends on the passed resolver r.

4.2.2. The resolver API

A resolver is an object that can be opened like a file, but you do not pass the file name to the resolver, but the XML identifier of the entity to read from (either a SYSTEM or PUBLIC clause). When opened, the resolver must return the Lexing.lexbuf that reads the characters. The resolver can be closed, and it can be cloned. Furthermore, it is possible to tell the resolver which character set it should assume. - The following from Pxp_reader:

exception Not_competent
exception Not_resolvable of exn

class type resolver =
  object
    method init_rep_encoding : rep_encoding -> unit
    method init_warner : collect_warnings -> unit
    method rep_encoding : rep_encoding
    method open_in : ext_id -> Lexing.lexbuf
    method close_in : unit
    method change_encoding : string -> unit
    method clone : resolver
    method close_all : unit
  end
The resolver object must work as follows:

  • When the parser is called, it tells the resolver the warner object and the internal encoding by invoking init_warner and init_rep_encoding. The resolver should store these values. The method rep_encoding should return the internal encoding.

  • If the parser wants to read from the resolver, it invokes the method open_in. Either the resolver succeeds, in which case the Lexing.lexbuf reading from the file or stream must be returned, or opening fails. In the latter case the method implementation should raise an exception (see below).

  • If the parser finishes reading, it calls the close_in method.

  • If the parser finds a reference to another external entity in the input stream, it calls clone to get a second resolver which must be initially closed (not yet connected with an input stream). The parser then invokes open_in and the other methods as described.

  • If you already know the character set of the input stream, you should recode it to the internal encoding, and define the method change_encoding as an empty method.

  • If you want to support multiple external character sets, the object must follow a much more complicated protocol. Directly after open_in has been called, the resolver must return a lexical buffer that only reads one byte at a time. This is only possible if you create the lexical buffer with Lexing.from_function; the function must then always return 1 if the EOF is not yet reached, and 0 if EOF is reached. If the parser has read the first line of the document, it will invoke change_encoding to tell the resolver which character set to assume. From this moment, the object can return more than one byte at once. The argument of change_encoding is either the parameter of the "encoding" attribute of the XML declaration, or the empty string if there is not any XML declaration or if the declaration does not contain an encoding attribute.

    At the beginning the resolver must only return one character every time something is read from the lexical buffer. The reason for this is that you otherwise would not exactly know at which position in the input stream the character set changes.

    If you want automatic recognition of the character set, it is up to the resolver object to implement this.

  • If an error occurs, the parser calls the method close_all for the top-level resolver; this method should close itself (if not already done) and all clones.

Exceptions It is possible to chain resolvers such that when the first resolver is not able to open the entity, the other resolvers of the chain are tried in turn. The method open_in should raise the exception Not_competent to indicate that the next resolver should try to open the entity. If the resolver is able to handle the ID, but some other error occurs, the exception Not_resolvable should be raised to force that the chain breaks.

Example: How to define a resolver that is equivalent to from_string: ...

4.2.3. Predefined resolver components

Note that the following section is not yet up to date. There are currently much more resolvers, especially resolvers for catalogs of PUBLIC and/or SYSTEM identifiers. See the interface Pxp_reader for details.

There are some classes in Pxp_reader that define common resolver behaviour.

class resolve_read_this_channel : 
    ?id:ext_id -> 
    ?fixenc:encoding -> 
    ?auto_close:bool -> 
    in_channel -> 
        resolver
Reads from the passed channel (it may be even a pipe). If the ~id argument is passed to the object, the created resolver accepts only this ID. Otherwise all IDs are accepted. - Once the resolver has been cloned, it does not accept any ID. This means that this resolver cannot handle inner references to external entities. Note that you can combine this resolver with another resolver that can handle inner references (such as resolve_as_file); see class 'combine' below. - If you pass the ~fixenc argument, the encoding of the channel is set to the passed value, regardless of any auto-recognition or any XML declaration. - If ~auto_close = true (which is the default), the channel is closed after use. If ~auto_close = false, the channel is left open.

class resolve_read_any_channel : 
    ?auto_close:bool -> 
    channel_of_id:(ext_id -> (in_channel * encoding option)) -> 
        resolver
This resolver calls the function ~channel_of_id to open a new channel for the passed ext_id. This function must either return the channel and the encoding, or it must fail with Not_competent. The function must return None as encoding if the default mechanism to recognize the encoding should be used. It must return Some e if it is already known that the encoding of the channel is e. If ~auto_close = true (which is the default), the channel is closed after use. If ~auto_close = false, the channel is left open.

class resolve_read_url_channel :
    ?base_url:Neturl.url ->
    ?auto_close:bool -> 
    url_of_id:(ext_id -> Neturl.url) -> 
    channel_of_url:(Neturl.url -> (in_channel * encoding option)) -> 
        resolver
When this resolver gets an ID to read from, it calls the function ~url_of_id to get the corresponding URL. This URL may be a relative URL; however, a URL scheme must be used which contains a path. The resolver converts the URL to an absolute URL if necessary. The second function, ~channel_of_url, is fed with the absolute URL as input. This function opens the resource to read from, and returns the channel and the encoding of the resource.

Both functions, ~url_of_id and ~channel_of_url, can raise Not_competent to indicate that the object is not able to read from the specified resource. However, there is a difference: A Not_competent from ~url_of_id is left as it is, but a Not_competent from ~channel_of_url is converted to Not_resolvable. So only ~url_of_id decides which URLs are accepted by the resolver and which not.

The function ~channel_of_url must return None as encoding if the default mechanism to recognize the encoding should be used. It must return Some e if it is already known that the encoding of the channel is e.

If ~auto_close = true (which is the default), the channel is closed after use. If ~auto_close = false, the channel is left open.

Objects of this class contain a base URL relative to which relative URLs are interpreted. When creating a new object, you can specify the base URL by passing it as ~base_url argument. When an existing object is cloned, the base URL of the clone is the URL of the original object. - Note that the term "base URL" has a strict definition in RFC 1808.

class resolve_read_this_string : 
    ?id:ext_id -> 
    ?fixenc:encoding -> 
    string -> 
        resolver
Reads from the passed string. If the ~id argument is passed to the object, the created resolver accepts only this ID. Otherwise all IDs are accepted. - Once the resolver has been cloned, it does not accept any ID. This means that this resolver cannot handle inner references to external entities. Note that you can combine this resolver with another resolver that can handle inner references (such as resolve_as_file); see class 'combine' below. - If you pass the ~fixenc argument, the encoding of the string is set to the passed value, regardless of any auto-recognition or any XML declaration.

class resolve_read_any_string : 
    string_of_id:(ext_id -> (string * encoding option)) -> 
        resolver
This resolver calls the function ~string_of_id to get the string for the passed ext_id. This function must either return the string and the encoding, or it must fail with Not_competent. The function must return None as encoding if the default mechanism to recognize the encoding should be used. It must return Some e if it is already known that the encoding of the string is e.

class resolve_as_file :
    ?file_prefix:[ `Not_recognized | `Allowed | `Required ] ->
    ?host_prefix:[ `Not_recognized | `Allowed | `Required ] ->
    ?system_encoding:encoding ->
    ?url_of_id:(ext_id -> Neturl.url) -> 
    ?channel_of_url: (Neturl.url -> (in_channel * encoding option)) ->
    unit -> 
        resolver
Reads from the local file system. Every file name is interpreted as file name of the local file system, and the referred file is read.

The full form of a file URL is: file://host/path, where 'host' specifies the host system where the file identified 'path' resides. host = "" or host = "localhost" are accepted; other values will raise Not_competent. The standard for file URLs is defined in RFC 1738.

Option ~file_prefix: Specifies how the "file:" prefix of file names is handled:

  • `Not_recognized:The prefix is not recognized.

  • `Allowed: The prefix is allowed but not required (the default).

  • `Required: The prefix is required.

Option ~host_prefix: Specifies how the "//host" phrase of file names is handled:

  • `Not_recognized:The prefix is not recognized.

  • `Allowed: The prefix is allowed but not required (the default).

  • `Required: The prefix is required.

Option ~system_encoding: Specifies the encoding of file names of the local file system. Default: UTF-8.

Options ~url_of_id, ~channel_of_url: Not for the casual user!

class combine : 
    ?prefer:resolver -> 
    resolver list -> 
        resolver
Combines several resolver objects. If a concrete entity with an ext_id is to be opened, the combined resolver tries the contained resolvers in turn until a resolver accepts opening the entity (i.e. it does not raise Not_competent on open_in).

Clones: If the 'clone' method is invoked before 'open_in', all contained resolvers are cloned separately and again combined. If the 'clone' method is invoked after 'open_in' (i.e. while the resolver is open), additionally the clone of the active resolver is flagged as being preferred, i.e. it is tried first.

This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml