Plasma GitLab Archive
Projects Blog Knowledge

Chapter 4. Configuring and calling the parser

4.1. Overview

There are the following main functions invoking the parser (in Pxp_yacc):

  • parse_document_entity: You want to parse a complete and closed document consisting of a DTD and the document body; the body is validated against the DTD. This mode is interesting if you have a file

    <!DOCTYPE root ... [ ... ] > <root> ... </root>
    and you can accept any DTD that is included in the file (e.g. because the file is under your control).
  • parse_wfdocument_entity: You want to parse a complete and closed document consisting of a DTD and the document body; but the body is not validated, only checked for well-formedness. This mode is preferred if validation costs too much time or if the DTD is missing.

  • parse_dtd_entity: You want only to parse an entity (file) containing the external subset of a DTD. Sometimes it is interesting to read such a DTD, for example to compare it with the DTD included in a document, or to apply the next mode:

  • parse_content_entity: You want only to parse an entity (file) containing a fragment of a document body; this fragment is validated against the DTD you pass to the function. Especially, the fragment must not have a <!DOCTYPE> clause, and must directly begin with an element. The element is validated against the DTD. This mode is interesting if you want to check documents against a fixed, immutable DTD.

  • parse_wfcontent_entity: This function also parses a single element without DTD, but does not validate it.

  • extract_dtd_from_document_entity: This function extracts the DTD from a closed document consisting of a DTD and a document body. Both the internal and the external subsets are extracted.

In many cases, parse_document_entity is the preferred mode to parse a document in a validating way, and parse_wfdocument_entity is the mode of choice to parse a file while only checking for well-formedness.

There are a number of variations of these modes. One important application of a parser is to check documents of an untrusted source against a fixed DTD. One solution is to not allow the <!DOCTYPE> clause in these documents, and treat the document like a fragment (using mode parse_content_entity). This is very simple, but inflexible; users of such a system cannot even define additional entities to abbreviate frequent phrases of their text.

It may be necessary to have a more intelligent checker. For example, it is also possible to parse the document to check fully, i.e. with DTD, and to compare this DTD with the prescribed one. In order to fully parse the document, mode parse_document_entity is applied, and to get the DTD to compare with mode parse_dtd_entity can be used.

There is another very important configurable aspect of the parser: the so-called resolver. The task of the resolver is to locate the contents of an (external) entity for a given entity name, and to make the contents accessible as a character stream. (Furthermore, it also normalizes the character set; but this is a detail we can ignore here.) Consider you have a file called "main.xml" containing

<!ENTITY % sub SYSTEM "sub/sub.xml">
%sub;
and a file stored in the subdirectory "sub" with name "sub.xml" containing
<!ENTITY % subsub SYSTEM "subsub/subsub.xml">
%subsub;
and a file stored in the subdirectory "subsub" of "sub" with name "subsub.xml" (the contents of this file do not matter). Here, the resolver must track that the second entity subsub is located in the directory "sub/subsub", i.e. the difficulty is to interpret the system (file) names of entities relative to the entities containing them, even if the entities are deeply nested.

There is not a fixed resolver already doing everything right - resolving entity names is a task that highly depends on the environment. The XML specification only demands that SYSTEM entities are interpreted like URLs (which is not very precise, as there are lots of URL schemes in use), hoping that this helps overcoming the local peculiarities of the environment; the idea is that if you do not know your environment you can refer to other entities by denoting URLs for them. I think that this interpretation of SYSTEM names may have some applications in the internet, but it is not the first choice in general. Because of this, the resolver is a separate module of the parser that can be exchanged by another one if necessary; more precisely, the parser already defines several resolvers.

The following resolvers do already exist:

  • Resolvers reading from arbitrary input channels. These can be configured such that a certain ID is associated with the channel; in this case inner references to external entities can be resolved. There is also a special resolver that interprets SYSTEM IDs as URLs; this resolver can process relative SYSTEM names and determine the corresponding absolute URL.

  • A resolver that reads always from a given O'Caml string. This resolver is not able to resolve further names unless the string is not associated with any name, i.e. if the document contained in the string refers to an external entity, this reference cannot be followed in this case.

  • A resolver for file names. The SYSTEM name is interpreted as file URL with the slash "/" as separator for directories. - This resolver is derived from the generic URL resolver.

The interface a resolver must have is documented, so it is possible to write your own resolver. For example, you could connect the parser with an HTTP client, and resolve URLs of the HTTP namespace. The resolver classes support that several independent resolvers are combined to one more powerful resolver; thus it is possible to combine a self-written resolver with the already existing resolvers.

Note that the existing resolvers only interpret SYSTEM names, not PUBLIC names. If it helps you, it is possible to define resolvers for PUBLIC names, too; for example, such a resolver could look up the public name in a hash table, and map it to a system name which is passed over to the existing resolver for system names. It is relatively simple to provide such a resolver.

This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml