If an element declaration does not allow the element to contain character data, the following rules apply.
If the element must be empty, i.e. it is declared with the keyword EMPTY, the element instance must be effectively empty (it must not even contain whitespace characters). The parser guarantees that a declared EMPTY element does never contain a data node, even if the data node represents the empty string.
If the element declaration only permits other elements to occur within that element but not character data, it is still possible to insert whitespace characters between the subelements. The parser ignores these characters, too, and does not create data nodes for them.
Example. Consider the following element types:
<!ELEMENT x ( #PCDATA | z )* > <!ELEMENT y ( z )* > <!ELEMENT z EMPTY>Only x may contain character data, the keyword #PCDATA indicates this. The other types are character-free.
The XML term
<x><z/> <z/></x>will be internally represented by an element node for x with three subnodes: the first z element, a data node containing the space character, and the second z element. In contrast to this, the term
<y><z/> <z/></y>is represented by an element node for y with only two subnodes, the two z elements. There is no data node for the space character because spaces are ignored in the character-free element y.
Parser option: By setting the parser option drop_ignorable_whitespace to false, the behaviour of the parser is changed such that even ignorable whitespace characters are represented by data nodes.
The XML specification allows all Unicode characters in XML texts. This parser can be configured such that UTF-8 is used to represent the characters internally; however, the default character encoding is ISO-8859-1. (Currently, no other encodings are possible for the internal string representation; the type Pxp_types.rep_encoding enumerates the possible encodings. Principially, the parser could use any encoding that is ASCII-compatible, but there are currently only lexical analyzers for UTF-8 and ISO-8859-1. It is currently impossible to use UTF-16 or UCS-4 as internal encodings (or other multibyte encodings which are not ASCII-compatible) unless major parts of the parser are rewritten - unlikely...)
The internal encoding may be different from the external encoding (specified in the XML declaration <?xml ... encoding="..."?>); in this case the strings are automatically converted to the internal encoding.
If the internal encoding is ISO-8859-1, it is possible that there are characters that cannot be represented. In this case, the parser ignores such characters and prints a warning (to the collect_warning object that must be passed when the parser is called).
The XML specification allows lines to be separated by single LF characters, by CR LF character sequences, or by single CR characters. Internally, these separators are always converted to single LF characters.
The parser guarantees that there are never two adjacent data nodes; if necessary, data material that would otherwise be represented by several nodes is collapsed into one node. Note that you can still create node trees with adjacent data nodes; however, the parser does not return such trees.
Note that CDATA sections are not represented specially; such sections are added to the current data material that being collected for the next data node.
Entities are not represented within documents! If the parser finds an entity reference in the document content, the reference is immediately expanded, and the parser reads the expansion text instead of the reference.
As attribute values are composed of Unicode characters, too, the same problems with the character encoding arise as for character material. Attribute values are converted to the internal encoding, too; and if there are characters that cannot be represented, these are dropped, and a warning is printed.
Attribute values are normalized before they are returned by methods like attribute. First, any remaining entity references are expanded; if necessary, expansion is performed recursively. Second, newline characters (any of LF, CR LF, or CR characters) are converted to single space characters. Note that especially the latter action is prescribed by the XML standard (but is not converted such that it is still possible to include line feeds into attributes).
Processing instructions are parsed to some extent: The first word of the PI is called the target, and it is stored separated from the rest of the PI:
<?target rest?>The exact location where a PI occurs is not represented (by default). The parser puts the PI into the object that represents the embracing construct (an element, a DTD, or the whole document); that means you can find out which PIs occur in a certain element, in the DTD, or in the whole document, but you cannot lookup the exact position within the construct.
Parser option: If you require the exact location of PIs, it is possible to create extra nodes for them. This mode is controled by the option enable_pinstr_nodes. The additional nodes have the node type T_pinstr target, and are created from special exemplars contained in the spec (see pxp_document.mli).
Normally, comments are not represented; they are dropped by default.
Parser option: However, if you require comment in the document tree, it is possible to create T_comment nodes for them. This mode can be specified by the option enable_comment_nodes. Comment nodes are created from special exemplars contained in the spec (see pxp_document.mli). You can access the contents of comments through the method comment.
These attributes are not supported specially; they are handled like any other attribute.
Note that the utility function strip_whitespace respects xml:space.