module ULB:sig
..end
unicode_lexbuf
record with
access functions. In this record, the data is available
in two forms: As an array of Unicode code points
ulb_chars
, and as string of encoded chars ulb_rawbuf
.
Both buffers are synchronised by ulb_chars_pos
. This
array stores where every character of ulb_chars
can be
found in ulb_rawbuf
.type
unicode_lexbuf = private {
|
mutable ulb_encoding : |
(* | The character encoding of ulb_rawbuf | *) |
|
mutable ulb_encoding_start : |
(* | The first character position to which ulb_encoding
applies (the encoding of earlier positions is
lost) | *) |
|
mutable ulb_rawbuf : |
(* | The encoded string to analyse | *) |
|
mutable ulb_rawbuf_len : |
(* | The filled part of ulb_rawbuf | *) |
|
mutable ulb_rawbuf_end : |
(* | The analysed part of ulb_rawbuf . We have always
ulb_rawbuf_end <= ulb_rawbuf_len . The analysed part
may be shorter than the filled part because there is
not enough space in ulb_chars , or because the filled
part ends with an incomplete multi-byte character | *) |
|
mutable ulb_rawbuf_const : |
(* | Whether ulb_rawbuf is considered as a constant. If
true , it is never blitted. | *) |
|
mutable ulb_chars : |
(* | The analysed part of ulb_rawbuf as array of Unicode
code points. Only the positions 0 to ulb_chars_len-1
of the array are filled. | *) |
|
mutable ulb_chars_pos : |
(* | For every analysed character this array stores the
byte position where the character begins in ulb_rawbuf .
In addition, the array contains at ulb_chars_len the
value of ulb_rawbuf_end .
This array is one element longer than | *) |
|
mutable ulb_chars_len : |
(* | The filled part of ulb_chars | *) |
|
mutable ulb_eof : |
(* | Whether EOF has been seen | *) |
|
mutable ulb_refill : |
(* | The refill function | *) |
|
mutable ulb_enc_change_hook : |
(* | This function is called when the encoding changes | *) |
|
mutable ulb_cursor : |
(* | Internally used by the implementation | *) |
val from_function : ?raw_size:int ->
?char_size:int ->
?enc_change_hook:(unicode_lexbuf -> unit) ->
refill:(Bytes.t -> int -> int -> int) ->
Netconversion.encoding -> unicode_lexbuf
unicode_lexbuf
to analyse strings of the
passed encoding
coming from the refill
function.
raw_size
: The initial size for ulb_rawbuf
. Defaults to 512char_size
: The initial size for ulb_chars
. Defaults to 256enc_change_hook
: This function is called when the encoding
is changed, either by this module, or by the user calling
set_encoding
.refill
: This function is called with arguments ulb_rawbuf
,
ulb_rawbuf_len
, and l
, where
l = String.length ulb_rawbuf - ulb_rawbuf_len
is the free
space in the buffer. The function should fill new bytes into
this substring, and return the number of added bytes. The
return value 0 signals EOF.val from_in_obj_channel : ?raw_size:int ->
?char_size:int ->
?enc_change_hook:(unicode_lexbuf -> unit) ->
Netconversion.encoding ->
Netchannels.in_obj_channel -> unicode_lexbuf
unicode_lexbuf
to analyse strings of the
passed encoding
coming from the object channel.
raw_size
: The initial size for ulb_rawbuf
. Defaults to 512char_size
: The initial size for ulb_chars
. Defaults to 256enc_change_hook
: This function is called when the encoding
is changed, either by this module, or by the user calling
set_encoding
.val from_string : ?enc_change_hook:(unicode_lexbuf -> unit) ->
Netconversion.encoding -> string -> unicode_lexbuf
unicode_lexbuf
analysing the passed string encoded in
the passed encoding. This function copies the input string.
enc_change_hook
: This function is called when the encoding
is changed, either by this module, or by the user calling
set_encoding
val from_bytes : ?enc_change_hook:(unicode_lexbuf -> unit) ->
Netconversion.encoding -> Bytes.t -> unicode_lexbuf
val from_bytes_inplace : ?enc_change_hook:(unicode_lexbuf -> unit) ->
Netconversion.encoding -> Bytes.t -> unicode_lexbuf
unicode_lexbuf
analysing the passed string encoded in
the passed encoding. This function does not copy the input string,
but uses it directly as ulb_rawbuf
. The string is not modified by ULB
,
but the caller must ensure that other program parts do not
modify it either.
enc_change_hook
: This function is called when the encoding
is changed, either by this module, or by the user calling
set_encoding
from_string_inplace
, this function has been removed
as strings are now considered immutable.val delete : int -> unicode_lexbuf -> unit
unicode_lexbuf
.
These characters
are removed from the beginning of the buffer, i.e.
ulb_chars.(n)
becomes the new first character of the
buffer. All three buffers ulb_rawbuf
, ulb_chars
, and
ulb_chars_pos
are blitted as necessary.
When the buffer is already at EOF, the function fails.
For efficiency, it should be tried to call delete
as seldom as
possible. Its speed is linear to the number of characters to move.
val refill : unicode_lexbuf -> unit
unicode_lexbuf
by calling the
ulb_refill
function. When the buffer is already at EOF, the
exception End_of_file
is raised, and the buffer is not modified.
Otherwise, the ulb_refill
function is called to
add new characters. If necessary, ulb_rawbuf
, ulb_chars
, and
ulb_chars_pos
are enlarged such that it is ensured that either
at least one new character is added, or that EOF is found for
the first time
In the latter case, ulb_eof
is set to true
(and the next call
of refill_unicode_lexbuf
will raise End_of_file
).val set_encoding : Netconversion.encoding -> unicode_lexbuf -> unit
encoding
to the passed value. This only affects future
refill
calls. The hook enc_change_hook
is invoked when defined.val close : unicode_lexbuf -> unit
ulb_eof
of the unicode_lexbuf
. The rest of the buffer
is not modifiedval utf8_sub_string : int -> int -> unicode_lexbuf -> string
int
arguments are the position and length of a sub
string of the lexbuf that is returned as UTF8 string. Position
and length are given as character multiples, not byte multiples.val utf8_sub_string_length : int -> int -> unicode_lexbuf -> int
String.length(utf8_sub_string args)
. Tries not to
allocate the UTF-8 string.