Plasma GitLab Archive
Projects Blog Knowledge

1.2. Highlights of XML

This section explains many of the features of XML, but not all, and some features not in detail. For a complete description, see the XML specification.

1.2.1. The DTD and the instance

The DTD contains various declarations; in general you can only use a feature if you have previously declared it. The document instance file may contain the full DTD, but it is also possible to split the DTD into an internal and an external subset. A document must begin as follows if the full DTD is included:

<?xml version="1.0" encoding="Your encoding"?>
<!DOCTYPE root [
  Declarations
]>
These declarations are called the internal subset. Note that the usage of entities and conditional sections is restricted within the internal subset.

If the declarations are located in a different file, you can refer to this file as follows:

<?xml version="1.0" encoding="Your encoding"?>
<!DOCTYPE root SYSTEM "file name">
The declarations in the file are called the external subset. The file name is called the system identifier. It is also possible to refer to the file by a so-called public identifier, but most XML applications won't use this feature.

You can also specify both internal and external subsets. In this case, the declarations of both subsets are mixed, and if there are conflicts, the declaration of the internal subset overrides those of the external subset with the same name. This looks as follows:

<?xml version="1.0" encoding="Your encoding"?>
<!DOCTYPE root  SYSTEM "file name" [
  Declarations
]>

The XML declaration (the string beginning with <?xml and ending at ?>) should specify the encoding of the file. Common values are UTF-8, and the ISO-8859 series of character sets. Note that every file parsed by the XML processor can begin with an XML declaration and that every file may have its own encoding.

The name of the root element must be mentioned directly after the DOCTYPE string. This means that a full document instance looks like

<?xml version="1.0" encoding="Your encoding"?>
<!DOCTYPE root  SYSTEM "file name" [
  Declarations
]>

<root>
  inner contents
</root>

1.2.2. Reserved characters

Some characters are generally reserved to indicate markup such that they cannot be used for character data. These characters are <, >, and &. Furthermore, single and double quotes are sometimes reserved. If you want to include such a character as character, write it as follows:

  • &lt; instead of <

  • &gt; instead of >

  • &amp; instead of &

  • &apos; instead of '

  • &quot; instead of "

All other characters are free in the document instance. It is possible to include a character by its position in the Unicode alphabet:
&#n;
where n is the decimal number of the character. Alternatively, you can specify the character by its hexadecimal number:
&#xn;
In the scope of declarations, the character % is no longer free. To include it as character, you must use the notations &#37; or &#x25;.

Note that besides &lt;, &gt;, &amp;, &apos;, and &quot; there are no predefines character entities. This is different from HTML which defines a list of characters that can be referenced by name (e.g. &auml; for ä); however, if you prefer named characters, you can declare such entities yourself (see below).

1.2.3. Elements and ELEMENT declarations

Elements structure the document instance in a hierarchical way. There is a top-level element, the root element, which contains a sequence of inner elements and character sections. The inner elements are structured in the same way. Every element has an element type. The beginning of the element is indicated by a start tag, written

<element-type>
and the element continues until the corresponding end tag is reached:
</element-type>
In XML, it is not allowed to omit start or end tags, even if the DTD would permit this. Note that there are no special rules how to interpret spaces or newlines near start or end tags; all spaces and newlines count.

Every element type must be declared before it can be used. The declaration consists of two parts: the ELEMENT declaration describes the content model, i.e. which inner elements are allowed; the ATTLIST declaration describes the attributes of the element.

An element can simply allow everything as content. This is written:

<!ELEMENT name ANY>
On the opposite, an element can be forced to be empty; declared by:
<!ELEMENT name EMPTY>
Note that there is an abbreviated notation for empty element instances: <name/>.

There are two more sophisticated forms of declarations: so-called mixed declarations, and regular expressions. An element with mixed content contains character data interspersed with inner elements, and the set of allowed inner elements can be specified. In contrast to this, a regular expression declaration does not allow character data, but the inner elements can be described by the more powerful means of regular expressions.

A declaration for mixed content looks as follows:

<!ELEMENT name (#PCDATA | element1 | ... | elementn )*>
or if you do not want to allow any inner element, simply
<!ELEMENT name (#PCDATA)>

Example

If element type q is declared as

<!ELEMENT q (#PCDATA | r | s)*>
this is a legal instance:
<q>This is character data<r></r>with <s></s>inner elements</q>
But this is illegal because t has not been enumerated in the declaration:
<q>This is character data<r></r>with <t></t>inner elements</q>

The other form uses a regular expression to describe the possible contents:

<!ELEMENT name regexp>
The following well-known regexp operators are allowed:

  • element-name

  • (subexpr1 , ... , subexprn )

  • (subexpr1 | ... | subexprn )

  • subexpr*

  • subexpr+

  • subexpr?

The , operator indicates a sequence of sub-models, the | operator describes alternative sub-models. The * indicates zero or more repetitions, and + one or more repetitions. Finally, ? can be used for optional sub-models. As atoms the regexp can contain names of elements; note that it is not allowed to include #PCDATA.

The exact syntax of the regular expressions is rather strange. This can be explained best by a list of constraints:

  • The outermost expression must not be element-name.

    Illegal: <!ELEMENT x y>; this must be written as <!ELEMENT x (y)>.

  • For the unary operators subexpr*, subexpr+, and subexpr?, the subexpr must not be again an unary operator.

    Illegal: <!ELEMENT x y**>; this must be written as <!ELEMENT x (y*)*>.

  • Between ) and one of the unary operatory *, +, or ?, there must not be whitespace.

    Illegal: <!ELEMENT x (y|z) *>; this must be written as <!ELEMENT x (y|z)*>.

  • There is the additional constraint that the right parenthsis must be contained in the same entity as the left parenthesis; see the section about parsed entities below.

Note that there is another restriction on regular expressions which must be deterministic. This means that the parser must be able to see by looking at the next token which alternative is actually used, or whether the repetition stops. The reason for this is simply compatability with SGML (there is no intrinsic reason for this rule; XML can live without this restriction).

Example

The elements are declared as follows:

<!ELEMENT q (r?, (s | t)+)>
<!ELEMENT r (#PCDATA)>
<!ELEMENT s EMPTY>
<!ELEMENT t (q | r)>
This is a legal instance:
<q><r>Some characters</r><s/></q>
(Note: <s/> is an abbreviation for <s></s>.) It would be illegal to leave <s/> out because at least one instance of s or t must be present. It would be illegal, too, if characters existed outside the r element; the only exception is white space. -- This is legal, too:
<q><s/><t><q><s/></q></t></q>

1.2.4. Attribute lists and ATTLIST declarations

Elements may have attributes. These are put into the start tag of an element as follows:

<element-name attribute1="value1" ... attributen="valuen">
Instead of "valuek" it is also possible to use single quotes as in 'valuek'. Note that you cannot use double quotes literally within the value of the attribute if double quotes are the delimiters; the same applies to single quotes. You can generally not use < and & as characters in attribute values. It is possible to include the paraphrases &lt;, &gt;, &amp;, &apos;, and &quot; (and any other reference to a general entity as long as the entity is not defined by an external file) as well as &#n;.

Before you can use an attribute you must declare it. An ATTLIST declaration looks as follows:

<!ATTLIST element-name 
          attribute-name attribute-type attribute-default
          ...
          attribute-name attribute-type attribute-default
>
There are a lot of types, but most important are:

  • CDATA: Every string is allowed as attribute value.

  • NMTOKEN: Every nametoken is allowed as attribute value. Nametokens consist (mainly) of letters, digits, ., :, -, _ in arbitrary order.

  • NMTOKENS: A space-separated list of nametokens is allowed as attribute value.

The most interesting default declarations are:

  • #REQUIRED: The attribute must be specified.

  • #IMPLIED: The attribute can be specified but also can be left out. The application can find out whether the attribute was present or not.

  • "value" or 'value': This particular value is used as default if the attribute is omitted in the element.

Example

This is a valid attribute declaration for element type r:

<!ATTLIST r 
          x CDATA    #REQUIRED
          y NMTOKEN  #IMPLIED
          z NMTOKENS "one two three">
This means that x is a required attribute that cannot be left out, while y and z are optional. The XML parser indicates the application whether y is present or not, but if z is missing the default value "one two three" is returned automatically.

This is a valid example of these attributes:

<r x="He said: &quot;I don't like quotes!&quot;" y='1'>

1.2.5. Parsed entities

Elements describe the logical structure of the document, while entities determine the physical structure. Entities are the pieces of text the parser operates on, mostly files and macros. Entities may be parsed in which case the parser reads the text and interprets it as XML markup, or unparsed which simply means that the data of the entity has a foreign format (e.g. a GIF icon).

If the parsed entity is going to be used as part of the DTD, it is called a parameter entity. You can declare a parameter entity with a fixed text as content by:

<!ENTITY % name "value">
Within the DTD, you can refer to this entity, i.e. read the text of the entity, by:
%name;
Such entities behave like macros, i.e. when they are referred to, the macro text is inserted and read instead of the original text.

Example

For example, you can declare two elements with the same content model by:

<!ENTITY % model "a | b | c">
<!ELEMENT x (%model;)>
<!ELEMENT y (%model;)>
If the contents of the entity are given as string constant, the entity is called an internal entity. It is also possible to name a file to be used as content (an external entity):
<!ENTITY % name SYSTEM "file name">
There are some restrictions for parameter entities:

  • If the internal parameter entity contains the first token of a declaration (i.e. <!), it must also contain the last token of the declaration, i.e. the >. This means that the entity either contains a whole number of complete declarations, or some text from the middle of one declaration.

    Illegal:

    <!ENTITY % e "(a | b | c)>">
    <!ELEMENT x %e;
    Because <! is contained in the main entity, and the corresponding > is contained in the entity e.
  • If the internal parameter entity contains a left paranthesis, it must also contain the corresponding right paranthesis.

    Illegal:

    <!ENTITY % e "(a | b | c">
    <!ELEMENT x %e;)>
    Because ( is contained in the entity e, and the corresponding ) is contained in the main entity.
  • When reading text from an entity, the parser automatically inserts one space character before the entity text and one space character after the entity text. However, this rule is not applied within the definition of another entity.

    Legal:

     
    <!ENTITY % suffix "gif"> 
    <!ENTITY iconfile 'icon.%suffix;'>
    Because %suffix; is referenced within the definition text for iconfile, no additional spaces are added.

    Illegal:

    <!ENTITY % suffix "test">
    <!ELEMENT x.%suffix; ANY>
    Because %suffix; is referenced outside the definition text of another entity, the parser replaces %suffix; by spacetestspace.

    Illegal:

    <!ENTITY % e "(a | b | c)">
    <!ELEMENT x %e;*>
    Because there is a whitespace between ) and *, which is illegal.
  • An external parameter entity must always consist of a whole number of complete declarations.

  • In the internal subset of the DTD, a reference to a parameter entity (internal or external) is only allowed at positions where a new declaration can start.

If the parsed entity is going to be used in the document instance, it is called a general entity. Such entities can be used as abbreviations for frequent phrases, or to include external files. Internal general entities are declared as follows:

<!ENTITY name "value">
External general entities are declared this way:
<!ENTITY name SYSTEM "file name">
References to general entities are written as:
&name;
The main difference between parameter and general entities is that the former are only recognized in the DTD and that the latter are only recognized in the document instance. As the DTD is parsed before the document, the parameter entities are expanded first; for example it is possible to use the content of a parameter entity as the name of a general entity: &#38;%name;;
[1].

General entities must respect the element hierarchy. This means that there must be an end tag for every start tag in the entity value, and that end tags without corresponding start tags are not allowed.

Example

If the author of a document changes sometimes, it is worthwhile to set up a general entity containing the names of the authors. If the author changes, you need only to change the definition of the entity, and do not need to check all occurrences of authors' names:

<!ENTITY authors "Gerd Stolpmann">
In the document text, you can now refer to the author names by writing &authors;.

Illegal: The following two entities are illegal because the elements in the definition do not nest properly:

<!ENTITY lengthy-tag "<section textcolor='white' background='graphic'>">
<!ENTITY nonsense    "<a></b>">

Earlier in this introduction we explained that there are substitutes for reserved characters: &lt;, &gt;, &amp;, &apos;, and &quot;. These are simply predefined general entities; note that they are the only predefined entities. It is allowed to define these entities again as long as the meaning is unchanged.

1.2.6. Notations and unparsed entities

Unparsed entities have a foreign format and can thus not be read by the XML parser. Unparsed entities are always external. The format of an unparsed entity must have been declared, such a format is called a notation. The entity can then be declared by referring to this notation. As unparsed entities do not contain XML text, it is not possible to include them directly into the document; you can only declare attributes such that names of unparsed entities are acceptable values.

As you can see, unparsed entities are too complicated in order to have any purpose. It is almost always better to simply pass the name of the data file as normal attribute value, and let the application recognize and process the foreign format.

Notes

[1]

This construct is only allowed within the definition of another entity; otherwise extra spaces would be added (as explained above). Such indirection is not recommended.

Complete example:

<!ENTITY % variant "a">      <!-- or "b" -->
<!ENTITY text-a "This is text A.">
<!ENTITY text-b "This is text B.">
<!ENTITY text "&#38;text-%variant;;">
You can now write &text; in the document instance, and depending on the value of variant either text-a or text-b is inserted.
This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml