Module Pxp_types

module Pxp_types: sig .. end

Type definitions used throughout PXP

This module re-exports all the types listed in Pxp_core_types.S (and finally defined in Pxp_core_types.I), so the user only has to open Pxp_types to get all relevant type definitions. The re-exported definitions are shown here in the indented grey block:

include Pxp_core_types.S

module StringMap: Map.S  with type key = string

For maps with string keys

Identifiers

type ext_id = Pxp_core_types.A.ext_id =

`\|`	`System of string`
`\|`	`Public of (string * string)`
`\|`	`Anonymous`
`\|`	`Private of private_id`

External identifiers are names for documents. A System identifier is a URL. PXP (without extensions) only supports file URLs in the form file:///directory/directory/.../file. Note that the percent encoding (% plus two hex digits) is supported in file URLs. A public identifier can be looked up in a catalog to find a local copy of the file; this type is mostly used for well-known documents (e.g. after standardization). A public identifier can be accompanied by a system identifier (Public(pubid,sysid)), but the system identifier can be the empty string. The value Anonymous should not be used to identify a real document; it is more thought as a placeholder when an ID is not yet known. Private identifiers are used by PXP internally. These identifiers have, unlike system or public IDs, no textual counterparts.

The identifiers are encoded as UTF-8 strings.

type private_id = Pxp_core_types.A.private_id

A private ID is an opaque identifier

val allocate_private_id : unit -> private_id

Get a new unique private ID

type resolver_id = Pxp_core_types.A.resolver_id = {

	`rid_private : private_id option;`
	`rid_public : string option;`
	`rid_system : string option;`
	`rid_system_base : string option;`

}

A resolver ID is a version of external identifiers used during resolving (i.e. the process of mapping the identifier to a real resource). The same entity can have several names during resolving: one private ID, one public ID, and one system ID. For resolving system IDs, the base URL is also remembered (usually the system ID of the opener of the entity).

val resolver_id_of_ext_id : ext_id -> resolver_id

The standard way of converting an ext_id into a resolver ID. A System ID is turned into a resolver_id where only rid_system is set. A Public ID is turned into a resolver_id where both rid_public and rid_system are set. A Private ID is turned into a resolver_id where only rid_private is set. An Anonymous ID is turned into a resolver_id without any value (all components are None).

type dtd_id = Pxp_core_types.A.dtd_id =

`\|`	`External of ext_id`	`(*`	DTD is completely external	`*)`
`\|`	`Derived of ext_id`	`(*`	DTD is derived from an external DTD	`*)`
`\|`	`Internal`	`(*`	DTD is completely internal	`*)`

Identifier for DTDs

Content models (in DTDs)

type content_model_type = Pxp_core_types.A.content_model_type =

`\|`	`Unspecified`	`(*`	A specification of the model has not yet been found	`*)`
`\|`	`Empty`	`(*`	Nothing is allowed as content	`*)`
`\|`	`Any`	`(*`	Everything is allowed as content	`*)`
`\|`	`Mixed of mixed_spec list`	`(*`	The contents consist of elements and `PCDATA` in arbitrary order. What is allowed in particular is given as `mixed_spec`.	`*)`
`\|`	`Regexp of regexp_spec`	`(*`	The contents are elements following this regular expression	`*)`

Element declaration in a DTD

type mixed_spec = Pxp_core_types.A.mixed_spec =

`\|`	`MPCDATA`	`(*`	`PCDATA` children are allowed	`*)`
`\|`	`MChild of string`	`(*`	This kind of Element is allowed	`*)`

Children of an element in "mixed"-style declaration

type regexp_spec = Pxp_core_types.A.regexp_spec =

`\|`	`Optional of regexp_spec`	`(*`	subexpression?	`*)`
`\|`	`Repeated of regexp_spec`	`(*`	subexpression*	`*)`
`\|`	`Repeated1 of regexp_spec`	`(*`	subexpression+	`*)`
`\|`	`Alt of regexp_spec list`	`(*`	subexpr1 \| subexpr2 \| ... \| subexprN	`*)`
`\|`	`Seq of regexp_spec list`	`(*`	subexpr1 , subexpr2 , ... , subexprN	`*)`
`\|`	`Child of string`	`(*`	This kind of Element is allowed here	`*)`

Children of an element in a regexp-style declaration

type att_type = Pxp_core_types.A.att_type =

`\|`	`A_cdata`	`(*`	`CDATA`	`*)`
`\|`	`A_id`	`(*`	`ID`	`*)`
`\|`	`A_idref`	`(*`	`IDREF`	`*)`
`\|`	`A_idrefs`	`(*`	`IDREFS`	`*)`
`\|`	`A_entity`	`(*`	`ENTITY`	`*)`
`\|`	`A_entities`	`(*`	`ENTITIES`	`*)`
`\|`	`A_nmtoken`	`(*`	`NMTOKEN`	`*)`
`\|`	`A_nmtokens`	`(*`	`NMTOKENS`	`*)`
`\|`	`A_notation of string list`	`(*`	`NOTATION` (name1 \| name2 \| ... \| nameN)	`*)`
`\|`	`A_enum of string list`	`(*`	(name1 \| name2 \| ... \| nameN)	`*)`

Attribute declaration in a DTD

type att_default = Pxp_core_types.A.att_default =

`\|`	`D_required`	`(*`	`#REQUIRED`	`*)`
`\|`	`D_implied`	`(*`	`#IMPLIED`	`*)`
`\|`	`D_default of string`	`(*`	a value default -- the value is already expanded	`*)`
`\|`	`D_fixed of string`	`(*`	`FIXED` value default -- the value is already expanded	`*)`

Default value of an attribute

Attribute value

type att_value = Pxp_core_types.A.att_value =

`\|`	`Value of string`
`\|`	`Valuelist of string list`
`\|`	`Implied_value`

Enumerates the possible values of an attribute:

Value s: The attribute is declared as a non-list type, or the attribute is undeclared; and the attribute is either defined with value "s", or it is missing but has the default value s.
[Valuelist [s1;...;sk]]: The attribute is declared as a list type, and the attribute is either defined with value "s1 ... sk" (space-separated words), or it is missing but has the default value "s1 ... sk".
Implied_value: The attribute is declared without default value, and there is no definition for the attribute.

Warnings

class type collect_warnings = object .. end

This object is sometimes used for outputting user warnings

class drop_warnings : collect_warnings

Drop any warnings

type warning = [ `W_XML_version_not_supported of string
       | `W_code_point_cannot_be_represented of int
       | `W_element_mentioned_but_not_declared of string
       | `W_entity_declared_twice of string
       | `W_multiple_ATTLIST_declarations of string
       | `W_multiple_attribute_declarations of string * string
       | `W_name_is_reserved_for_extensions of string ]

Kinds of warnings

class type symbolic_warnings = object .. end

This object is sometimes used for outputting user warnings

val string_of_warning : warning -> string

Turn the warning into a human-readable message

val warn : symbolic_warnings option ->
       collect_warnings -> warning -> unit

Send a warning to the symbolic_warnings object, and then to the collect_warnings object.

Encoding

type encoding = Netconversion.encoding

For the representation of external resources (files etc.) we accept all encodings for character sets which are defined in Netconversion (package netstring).

type rep_encoding = [ `Enc_cp1006
       | `Enc_cp437
       | `Enc_cp737
       | `Enc_cp775
       | `Enc_cp850
       | `Enc_cp852
       | `Enc_cp855
       | `Enc_cp856
       | `Enc_cp857
       | `Enc_cp860
       | `Enc_cp861
       | `Enc_cp862
       | `Enc_cp863
       | `Enc_cp864
       | `Enc_cp865
       | `Enc_cp866
       | `Enc_cp869
       | `Enc_cp874
       | `Enc_iso88591
       | `Enc_iso885910
       | `Enc_iso885913
       | `Enc_iso885914
       | `Enc_iso885915
       | `Enc_iso885916
       | `Enc_iso88592
       | `Enc_iso88593
       | `Enc_iso88594
       | `Enc_iso88595
       | `Enc_iso88596
       | `Enc_iso88597
       | `Enc_iso88598
       | `Enc_iso88599
       | `Enc_koi8r
       | `Enc_macroman
       | `Enc_usascii
       | `Enc_utf8
       | `Enc_windows1250
       | `Enc_windows1251
       | `Enc_windows1252
       | `Enc_windows1253
       | `Enc_windows1254
       | `Enc_windows1255
       | `Enc_windows1256
       | `Enc_windows1257
       | `Enc_windows1258 ]

The subset of encoding that may be used for the internal representation of strings. The common property of the following encodings is that they are ASCII-compatible - the PXP code relies on that.

Exceptions

exception Validation_error of string

Violation of a validity constraint

exception WF_error of string

Violation of a well-formedness constraint

exception Namespace_error of string

Violation of a namespace constraint

exception Error of string

Other error

exception Character_not_supported

exception At of (string * exn)

The string is a description where the exn happened. The exn value can again be At(_,_) (for example, when an entity within an entity causes the error).

exception Undeclared

Indicates that no declaration is available and because of this every kind of usage is allowed. (Raised by some DTD methods.)

exception Method_not_applicable of string

Indicates that a method has been called that is not applicable for the class. The argument is the name of the method.

exception Namespace_method_not_applicable of string

Indicates that the called method is a namespace method but that the object does not support namespaces. The argument is the name of the method.

exception Not_competent

The resolver cannot open this kind of entity ID

exception Not_resolvable of exn

While opening the entity, the nested exception occurred

exception Namespace_not_managed of string

A namespace URI is used but not declared in the namespace manager. The string argument is the URI in question.

exception Namespace_prefix_not_managed of string

A namespace prefix is used but not declared in the namespace manager. The string argument is the prefix in question.

exception Namespace_not_in_scope of string

The namespace scope does not know the URI

val string_of_exn : exn -> string

Converts a PXP exception into a readable string

Output destination

type output_stream = [ `Out_buffer of Buffer.t
       | `Out_channel of Pervasives.out_channel
       | `Out_function of string -> int -> int -> unit
       | `Out_netchannel of Netchannels.out_obj_channel ]

Designates an output destination for several printers:

`Out_buffer b: Output to buffer b
`Out_channel ch: Output to channel ch
`Out_function f: Output to function f. The function f is used like Pervasives.output_string.
`Out_netchannel n: Output to the ocamlnet channel n

val write : output_stream -> string -> int -> int -> unit

write os s pos len: Writes the string (portion) to the buffer/channel/stream

Pools

type pool = Pxp_core_types.A.pool

A pool designates a way to increase string sharing

val make_probabilistic_pool : ?fraction:float -> int -> pool

A probalistic string pool tries to map strings to pool strings in order to make it more likely that equal strings are stored in the same memory block. The int argument is the size of the pool; this is the number of entries of the pool. However, not all entries of the pool are used; the fraction argument (default: 0.3) determines the fraction of the actually used entries. The higher the fraction is, the more strings can be managed at the same time; the lower the fraction is, the more likely it is that a new string can be added to the pool.

val pool_string : pool -> string -> string

Tries to find the passed string in the pool; if the string is in the pool, the pool string is returned. Otherwise, the function tries to add the passed string to the pool, and the passed string is returned.

Configuration

type config = {

`warner : collect_warnings;`	`(*`	An object that collects warnings.	`*)`
`swarner : symbolic_warnings option;`	`(*`	Another object getting warnings expressed as polymorphic variants. This is especially useful to turn warnings into errors. If defined, the `swarner` gets the warning first before it is sent to the classic `warner`.	`*)`
`enable_pinstr_nodes : bool;`	`(*`	if `true`, processing instructions (PI's) are represented by nodes of their own in the document tree. If not enabled, PI's are attached to their surrounding elements, and the exact location within the element is lost. For example, if the XML text is `<s><?x?>foo<?y?></s>`, the parser normally produces only an element object for `s`, and attaches the PIs `x` and `y` to it (without order), and the details of `x` and `y` can be only found out with the `pinstr` method of the surrounding element. The only subelement is the data node for "foo". If `enable_pinstr_nodes` the node for element `s` will contain two additional subnodes of type `T_pinstr`, one as left sibling of "foo", and one as right sibling. Any code processing such a tree must be prepared that processing instructions occur as normal tree members, and are no longer attached to the surrounding nodes. The event-based parser reacts on the `enable_pinstr_nodes` mode by emitting `E_pinstr` events exactly at the locations where the PI's occur in the text.	`*)`
`enable_comment_nodes : bool;`	`(*`	When enabled, comments are represented as nodes with type `T_comment`. If not enabled, comments are ignored. Event-based parser: This flag controls whether E_comment events are generated.	`*)`
`enable_super_root_node : bool;`	`(*`	The `enable_super_root_node` changes the layout of the document tree: The top-most node is no longer the top-most element of the document (i.e. the element root), but a special node called the super root node (`T_super_root`). The top-most element is then a child of the super root node. The super root node can have further children, namely comment nodes and processing instructions that are placed before or after the top-most element in the XML text. However, the exact behaviour depends on whether the other special modes in the configuration are also enabled: If `enable_pinstr_nodes` is also true, processing instruction nodes (`T_pinstr`) can occur as children of the super root node when processing instructions occur before or after the root element. If `enable_pinstr_nodes` is false, these instructions are simply attached to the super root node as they would be attached to ordinary elements within the tree. Note that processing instructions in the DTD part of the XML text are not meant here (i.e. instructions between the square brackets, or in an external DTD). These instructions are always attached to the DTD object (see `Pxp_dtd.dtd`). If `enable_comment_nodes` is also true, comment nodes can occur as children of the super root node when comments occur before or after the root element. If `enable_comment_nodes` is false, comments are ignored.	`*)`
`drop_ignorable_whitespace : bool;`	`(*`	Ignorable whitespace is whitespace between XML nodes where the DTD does not specify that `#PCDATA` must be parsed. For example, if the DTD contains `<!ELEMENT a (b,c)> <!ELEMENT b (#PCDATA)*> <!ELEMENT c EMPTY>` the XML text `<a><b> </b> <c></c></a>` is legal. There are two spaces: Between `<b>` and `</b>`. Because `b` is declared with `#PCDATA`, this space character is not ignorable, and the parser will create a data node containing the character Between `</b>` and `<c>`. Because the declaration of `a` does not contain the keyword `#PCDATA`, character data is not expected at this position. However, XML allows that whitespace can be written here in order to improve the readability of the XML text. Such whitespace material is considered as "ignorable whitespace". If `drop_ignorable_whitespace` is true, the parser will not create a data node containing the character. Otherwise, the parser does create such a data node. Note that `c` is declared as `EMPTY`. XML does not allow space characters between `<c>` and `</c>` such that it is not the question whether such characters are to be ignored or not - they are simply illegal and will lead to a parsing error. In the well-formed mode, the parser treats every whitespace character occuring in an element as non-ignorable. Event-based parser: ignored. (Maybe there will be a stream filter with the same effect if I find time to program it.)	`*)`
`encoding : rep_encoding;`	`(*`	Specifies the encoding used for the internal representation of any character data.	`*)`
`recognize_standalone_declaration : bool;`	`(*`	Whether the `standalone` declaration is recognized or not. This option does not have an effect on well-formedness parsing: in this case such declarations are never recognized. Recognizing the `standalone` declaration means that the value of the declaration is scanned and passed to the DTD, and that the standalone-check is performed. This means: If a document is flagged `standalone='yes'` some additional constraints apply. The idea is that a parser without access to any external document subsets can still parse the document, and will still return the same values as the parser with such access. For example, if the DTD is external and if there are attributes with default values, it is checked that there is no element instance where these attributes are omitted - the parser would return the default value but this requires access to the external DTD subset. Event-based parser: The option has an effect if the `Parse_xml_decl entry flag is set. In this case, it is passed to the DTD whether there is a standalone declaration, ... and the rest is unclear.	`*)`
`store_element_positions : bool;`	`(*`	Whether the file name, the line and the column of the beginning of elements are stored in the element nodes. This option may be useful to generate error messages. Positions are only stored for: Elements Processing instructions if `T_pinstr` nodes are created for them (see `enable_pinstr_nodes`) For all other node types, no position is stored. You can access positions by the method `position` of nodes. Event-based parser: If true, the `E_position` events will be generated.	`*)`
`idref_pass : bool;`	`(*`	Whether the parser does a second pass and checks that all `IDREF` and `IDREFS` attributes contain valid references. This option works only if an ID index is available. To create an ID index, pass an index object as `id_index` argument to the parsing functions (such as `Pxp_tree_parser.parse_document_entity`). "Second pass" does not mean that the XML text is again parsed; only the existing document tree is traversed, and the check on bad `IDREF`/`IDREFS` attributes is performed for every node. Event-based parser: this option is ignored.	`*)`
`validate_by_dfa : bool;`	`(*`	If true, and if DFAs are available for validation, the DFAs will actually be used for validation. If false, or if no DFAs are available, the standard backtracking algorithm will be used. DFAs are only available if `accept_only_deterministic_models` is true (because in this case, it is relatively cheap to construct the DFAs). DFAs are a data structure which ensures that validation can always be performed in linear time. I strongly recommend using DFAs; however, there are examples for which validation by backtracking is faster. Event-based parser: this option is ignored.	`*)`
`accept_only_deterministic_models : bool;`	`(*`	Whether only deterministic content models are accepted in DTDs. Event-based parser: this option is ignored.	`*)`
`disable_content_validation : bool;`	`(*`	When set to true, content validation is disabled; however, other validation checks remain activated. This option is intended to save time when a validated document is parsed and it can be assumed that it is valid. Do not forget to set `accept_only_deterministic_models` to false to save maximum time (or DFAs will be computed which is rather expensive). Event-based parser: this option is ignored.	`*)`
`name_pool : Pxp_core_types.I.pool;`
`enable_name_pool_for_element_types : bool;`
`enable_name_pool_for_attribute_names : bool;`
`enable_name_pool_for_attribute_values : bool;`
`enable_name_pool_for_pinstr_targets : bool;`	`(*`	The name pool maps strings to pool strings such that strings with the same value share the same block of memory. Enabling the name pool saves memory, but makes the parser slower. Event-based parser: As far as I remember, some of the pool options are honoured, but not all.	`*)`
`enable_namespace_processing : Pxp_dtd.namespace_manager option;`	`(*`	Setting this option to a `namespace_manager` enables namespace processing. This works only if the namespace-aware implementation `namespace_element_impl` of element nodes is used in the spec; otherwise you will get error messages complaining about missing methods. Note that PXP uses a technique called "prefix normalization" to implement namespaces on top of the plain document model. This means that the namespace prefixes of elements and attributes are changed to unique prefixes if they are ambiguous, and that these "normprefixes" are actually stored in the document tree. Furthermore, the normprefixes are used for validation. (See `Intro_namespaces` for details.) Event-based parser: If true, the events `E_ns_start_tag` and `E_ns_end_tag` are generated instead of `E_start_tag`, and `E_end_tag`, respectively.	`*)`
`escape_contents : (Pxp_lexer_types.token -> Pxp_entity_manager.entity_manager -> string) option;`	`(*`	Experimental feature. If defined, the `escape_contents` function is called whenever the tokens "{", "{{", "}", or "}}" are found in the context of character data contents. The first argument is the token. The second argument is the entity manager, it can be used to access the lexing buffer directly. The result of the function are the characters to substitute. "{" is the token `Lcurly`, "{{" is the token `LLcurly`, "}" is the token `Rcurly`, and "}}" is the token `RRcurly`. Event-based parser: this option works.	`*)`
`escape_attributes : (Pxp_lexer_types.token -> int -> Pxp_entity_manager.entity_manager -> string) option;`	`(*`	Experimental feature. If defined, the `escape_attributes` function is called whenever the tokens "{", "{{", "}", or "}}" are found inside attribute values. The function takes three arguments: The token (`Lcurly`, `LLcurly`, `Rcurly` or `RRcurly`), the position in the attribute value, and the entity manager. The result of the function is the string substituted for the token. Example: The attribute is "a{b{{c", and the function is called as follows: `escape_attributes Lcurly 1 mng` - result is "42" (or an arbitrary string, but in this example it is "42") `escape_attributes LLcurly 4 mng` - result is "foo" The resulting attribute value is then "a42bfooc". See also `escape_contents`. Event-based parser: this option works.	`*)`
`debugging_mode : bool;`

}

val default_config : config

Default configuration. This is a recommended set of options that works generally:

Warnings are thrown away
Error messages will contain line numbers
Neither T_super_root nor T_pinstr nor T_comment nodes are generated
The internal encoding is ISO-8859-1
The standalone declaration is checked
Element positions are stored
The IDREF pass is left out
If available, DFAs are used for validation
Only deterministic content models are accepted
Namespace processing is turned off

val default_namespace_config : config

Deprecated. Same as default_config, but namespace processing is turned on. Note however, that a globally defined namespace manager is used. Because of this, this config should no longer be used. Instead, do

         let m = Pxp_dtd.create_namespace_manager() in

         let namespace_config =

               { default_config with

                    enable_namespace_processing = Some m

               }

and take control of the scope of m.

Sources

Sources specify where the XML text to parse comes from. The type source is often not used directly, but sources are constructed with the help of the functions from_channel, from_obj_channel, from_file, and from_string (see below). Note that you can usually view the type source as an opaque type. There is no need to understand why it enumerates these three cases, or to use them directly. Just create sources with one of the from_* functions.

The type source is an abstraction on top of resolver (defined in module Pxp_reader). The resolver is a configurable object that knows how to access files that are

identified by an XML ID (a PUBLIC or SYSTEM name)
named relative to another file
referred to by the special PXP IDs Private and Anonymous.

Furthermore, the resolver knows a lot about the character encoding of the files. See Pxp_reader for details.

A source is a resolver that is applied to a certain ID that should be initially opened.

type source = Pxp_dtd.source =

`\|`	`Entity of ((Pxp_dtd.dtd -> Pxp_entity.entity) * Pxp_reader.resolver)`
`\|`	`ExtID of (Pxp_core_types.I.ext_id * Pxp_reader.resolver)`
`\|`	`XExtID of (Pxp_core_types.I.ext_id * string option * Pxp_reader.resolver)`

The three basic flavours of sources:

Entity(m,r) is a very low-level way of denoting a source. After the parser has created the DTD object d, it calls
e = m d
and uses the entity object e together with the resolver r. This kind of source is intended to implement customized versions of the entity classes. Use it only if there is a strong need to do so.
ExtID(xid,r) is the normal way of denoting a source. The external entity referred to by the ID xid is opened by using the resolver r.
XExtID(xid,sys_base,r) is an extension of ExtID. The additional parameter sys_base is the base URI to assume if xid is a relative URI (i.e. a SYSTEM ID).

val from_channel : ?alt:Pxp_reader.resolver list ->
       ?system_id:string ->
       ?fixenc:encoding ->
       ?id:ext_id ->
       ?system_encoding:encoding -> Pervasives.in_channel -> source

This function creates a source that reads the XML text from the passed in_channel. By default, this source is not able to read XML text from any other location (you cannot read from files etc.). The optional arguments allow it to modify this behaviour.

Keep the following in mind:

Because this source reads from a channel, it can only be used once.
The channel will be closed by the parser when the end of the channel is reached, or when the parser stops because of another reason.
Unless the alt argument specifies something else, you cannot refer to entities by SYSTEM or PUBLIC names (error "no input method available")
Even if you pass an alt method that can handle SYSTEM, it is not immediately possible to open SYSTEM entities that are defined by a URL relative to the entity that is accessed over the in_channel. You first must pass the system_id argument, so the parser knows the base name relative to which other SYSTEM entities can be resolved.
For more instructions how to construct sources and resolvers look at Intro_resolution.

Arguments:

alt: A list of further resolvers that are used to open further entities referenced in the initially opened entity. For example, you can pass new Pxp_reader.resolve_as_file() to enable resolving of file names found in SYSTEM IDs.
system_id: By default, the XML text found in the in_channel does not have any visible ID (to be exact, the in_channel has a private ID, but this is hidden). Because of this, it is not possible to open a second file by using a relative SYSTEM ID. The parameter system_id assigns the channel a SYSTEM ID that is only used to resolve further relative SYSTEM IDs. - This parameter must be encoded as UTF-8 string.
fixenc: By default, the character encoding of the XML text is determined by looking at the XML declaration. Setting fixenc forces a certain character encoding. Useful if you can assume that the XML text has been recoded by the transmission media.

Deprecated arguments:

id: This parameter assigns the channel an arbitrary ID (like system_id, but PUBLIC, anonymous, and private IDs are also possible - although not reasonable). Furthermore, setting id also enables resolving of file names. id has higher precedence than system_id.
system_encoding: (Only useful together with id.) The character encoding used for file names. (UTF-8 by default.)

val from_obj_channel : ?alt:Pxp_reader.resolver list ->
       ?system_id:string ->
       ?fixenc:encoding ->
       ?id:ext_id ->
       ?system_encoding:encoding -> Netchannels.in_obj_channel -> source

Similar to from_channel, but reads from an Ocamlnet netchannel instead.

val from_string : ?alt:Pxp_reader.resolver list ->
       ?system_id:string -> ?fixenc:encoding -> string -> source

Similar to from_channel, but reads from a string.

Of course, it is possible to parse this source several times, unlike the channel-based sources.

val from_file : ?alt:Pxp_reader.resolver list ->
       ?system_encoding:encoding -> ?enc:encoding -> string -> source

This source reads initially from the file whose name is passed as string argument. The filename must be UTF-8-encoded (so it can be correctly rewritten into a URL).

This source can open further files by default, and relative URLs work immediately.

Arguments:

alt: A list of further resolvers, especially useful to open non-SYSTEM IDs, and non-file entities.
system_encoding: The character encoding the system uses to represent filenames. By default, UTF-8 is assumed.
enc: The character encoding of the string argument. As mentioned, this is UTF-8 by default.

Examples.

The source
from_file "/tmp/file.xml"
reads from this file, which is assumed to have the ID SYSTEM "file://localhost/tmp/file.xml". It is no problem when other files are included by either absolute SYSTEM file name, or by a relative SYSTEM.
The source
let ch = open_in "/tmp/file.xml" in from_channel ~alt:[ new Pxp_reader.resolve_as_file() ] ~system_id:"file://localhost/tmp/file.xml" ch
does roughly the same, but uses a channel for the initially opened entity. Because of the alt argument, it is possible to reference other entities by absolute SYSTEM name. The system_id assignment makes it possible that SYSTEM names relative to the initially used entity are resolvable.
The source
let cat = new Pxp_reader.lookup_id [ Public("My Public ID",""),"/usr/share/xml/public.xml" ] in from_file ~alt:[cat] "/tmp/file.xml"
sets that the PUBLIC ID "My Public ID" is mapped to the shown file, i.e. this file is parsed when this PUBLIC ID occurs in the XML text. (Without mapping PUBLIC names these cannot be resolved.)

val open_source : config ->
       source ->
       bool -> Pxp_dtd.dtd -> Pxp_reader.resolver * Pxp_entity.entity

Returns the resolver and the entity for a source. The boolean arg determines whether a document entity (true) or a normal external entity (false) will be returned.

Entities

See Pxp_dtd.Entity for functions dealing with entities.

type entity_id = Pxp_lexer_types.entity_id

An entity_id is an identifier for an entity, or a fake identifier.

type entity = Pxp_entity.entity

The representation of entities

Event parsing

type entry = [ `Entry_content of [ `Dummy ] list
       | `Entry_declarations of [ `Extend_dtd_fully | `Val_mode_dtd ] list
       | `Entry_document of
           [ `Extend_dtd_fully | `Parse_xml_decl | `Val_mode_dtd ] list
       | `Entry_element_content of [ `Dummy ] list
       | `Entry_expr of [ `Dummy ] list ]

Entry points for the parser (used to call process_entity):

`Entry_document: The parser reads a complete document that must have a DOCTYPE and may have a DTD.
`Entry_declarations: The parser reads the external subset of a DTD
`Entry_element_content: The parser reads an entity containing contents, but there must be one top element, i.e. "misc* element misc*". At the beginning, there can be an XML declaration as for external entities.
`Entry_content: The parser reads an entity containing contents, but without the restriction of having a top element. At the beginning, there can be an XML declaration as for external entities.
`Entry_expr: The parser reads a single element, a single processing instruction or a single comment, or whitespace, whatever is found. In contrast to the other entry points, the expression need not to be a complete entity, but can start and end in the middle of an entity

More entry points might be defined in the future.

The entry points have a list of flags. Note that `Dummy is ignored and only present because O'Caml does not allow empty variants. For `Entry_document, and `Entry_declarations, the flags determine the kind of DTD object that is generated.

Without flags, the DTD object is configured for well-formedness mode:

Elements, attributes, and notations found in the XML text are not added to the DTD; entity declarations are added, however. Additionally, the DTD is configured such that it does not complain about missing elements, attributes, and notations (dtd#arbitrary_allowed).

The flags affecting the DTD have the following meaning. Keep in mind that the event parser can only conduct some validation checks because it does not represent the XML nodes as tree.

`Extend_dtd_fully: Elements, attributes, and notations are added to the DTD. The DTD mode dtd#arbitrary_allowed is enabled. If the resulting event stream is validated later, this mode has the effect that the actually declared elements, attributes, and notations are validated as declared. Also, non-declared elements, attributes, and notations are not rejected, but handled as in well-formed mode.
`Val_mode_dtd: The DTD object is set up for validation, i.e. all declarations are added to the DTD, and dtd#arbitrary_allowed is disabled. Furthermore, some validation checks are already done for the DTD (e.g. whether the root element is declared). If the resulting event stream is validated later, all validation checks are conducted (except for the XML declaration - see the next flag - this check must be separately enabled).
`Parse_xml_decl: By default, the XML declaration <?xml version="1.0" encoding="..." standalone="..."?> is ignored except for the encoding attribute. This flag causes that the XML declaration is completely parsed.

type event =

`\|`	`E_start_doc of (string * Pxp_dtd.dtd)`	`(*`	Starts a document. The string is the XML version ("1.0")	`*)`
`\|`	`E_end_doc of string`	`(*`	Ends a document. The string is the literal name of the root element (without any normalization or transformation)	`*)`
`\|`	`E_start_tag of (string * (string * string) list * Pxp_dtd.namespace_scope option * Pxp_lexer_types.entity_id)`	`(*`	`(name, attlist, scope_opt, entid)`: Starts an element `name` with an attribute list `attlist`. `scope_opt` is the scope object in namespace mode, otherwise `None`. `entid` identifies the identity where the start tag occurs	`*)`
`\|`	`E_end_tag of (string * Pxp_lexer_types.entity_id)`	`(*`	`(name,entid)`: Ends the element `name` in entity `entid`.	`*)`
`\|`	`E_char_data of string`	`(*`	Character data	`*)`
`\|`	`E_pinstr of (string * string * Pxp_lexer_types.entity_id)`	`(*`	A processing instruction `<?target value?>`	`*)`
`\|`	`E_comment of string`	`(*`	A comment node. The string does not include the delimiters	`*)`
`\|`	`E_start_super`	`(*`	Starts the super root	`*)`
`\|`	`E_end_super`	`(*`	Ends the super root	`*)`
`\|`	`E_position of (string * int * int)`	`(*`	`(entity,line,pos)`: Describes that the next element, which is either `E_start_tag`, `E_pinstr`, or `E_comment`, is located in `entity` at `line` and character position `pos`.	`*)`
`\|`	`E_error of exn`	`(*`	May occur as last event in a stream to describe an error	`*)`
`\|`	`E_end_of_stream`	`(*`	If the text can be parsed without error, this event is the last event of the stream	`*)`

The type of XML events. In event mode, the parser emits a stream of these events. The parser already checks that certain structural properties are met:

Start and end tags (including those of the super root) are properly nested
Start and end tags of elements are in the same entity

If a whole document is parsed (entry `Entry_document), the events of the text are surrounded by E_start_doc and E_end_doc, i.e. the overall structure is:

E_start_doc
Now the elements (or the super root)
E_end_doc
E_error or E_end_of_stream

For the entries `Entry_content and `Entry_expr the document events are left out. The final E_error or E_end_of_stream event is nevertheless emitted.

This web site is published by Informatikbüro Gerd Stolpmann

Plasma	GitLab	Archive
Projects	Blog	Knowledge