Plasma GitLab Archive
Projects Blog Knowledge

Module Netencoding.Html


module Html: sig .. end


Encodes characters that need protection by converting them to entity references. E.g. "<" is converted to "&lt;". As the entities may be named, there is a dependency on the character set.

Legacy functions:
val encode_from_latin1 : string -> string
val decode_to_latin1 : string -> string

These functions have a more general interface and should be preferred in new programs.
val unsafe_chars_html4 : string
The string contains '<', '>', '"', '&' and the control characters 0-8, 11-12, 14-31, 127.
val encode : in_enc:Netconversion.encoding ->
?out_enc:Netconversion.encoding ->
?prefer_name:bool -> ?unsafe_chars:string -> unit -> string -> string
The input string that is encoded as in_enc is recoded to out_enc, and the following characters are encoded as HTML entity (&name; or &#num;):
  • The ASCII characters contained in unsafe_chars
  • The characters that cannot be represented in out_enc. By default (out_enc=`Enc_usascii), only ASCII characters can be represented, and thus all code points >= 128 are encoded as HTML entities. If you pass out_enc=`Enc_utf8, all characters can be represented.
For example, the string "(a<b) & (c>d)" is encoded as "(a&lt;b) &amp; (c&gt;d)".

It is required that out_enc is an ASCII-compatible encoding.

The option prefer_name selects whether named entities (e.g. &lt;) or numeric entities (e.g. &#60;) are prefered.

The efficiency of the function can be improved when the same encoding is applied to several strings. Create a specialized encoding function by passing all arguments up to the unit argument, and apply this function several times. For example:

     let my_enc = encode ~in_enc:`Enc_utf8 () in
     let s1' = my_enc s1 in
     let s2' = my_enc s2 in ...
 

type entity_set = [ `Empty | `Html | `Xml ] 
val decode : in_enc:Netconversion.encoding ->
out_enc:Netconversion.encoding ->
?lookup:(string -> string) ->
?subst:(int -> string) ->
?entity_base:entity_set -> unit -> string -> string
The input string is recoded from in_enc to out_enc, and HTML entities (&name; or &#num;) are resolved. The input encoding in_enc must be ASCII-compatible.

By default, the function knows all entities defined for HTML 4 (this can be changed using entity_base, see below). If other entities occur, the function lookup is called and the name of the entity is passed as input string to the function. It is expected that lookup returns the value of the entity, and that this value is already encoded as out_enc. By default, lookup raises a Failure exception.

If a character cannot be represented in the output encoding, the function subst is called. subst must return a substitute string for the character. By default, subst raises a Failure exception.

The option entity_base determines which set of entities are considered as the known entities that can be decoded without help by the lookup function: `Html selects all entities defined for HTML 4, `Xml selects only &lt;, &gt;, &amp;, &quot;, and &apos;, and `Empty selects the empty set (i.e. lookup is always called).

This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml