module ULB:sig
..end
This module provides the unicode_lexbuf
record with
access functions. In this record, the data is available
in two forms: As an array of Unicode code points
ulb_chars
, and as string of encoded chars ulb_rawbuf
.
Both buffers are synchronised by ulb_chars_pos
. This
array stores where every character of ulb_chars
can be
found in ulb_rawbuf
.
type
unicode_lexbuf = private {
|
mutable ulb_encoding : |
(* | The character encoding of | *) |
|
mutable ulb_encoding_start : |
(* | The first character position to which | *) |
|
mutable ulb_rawbuf : |
(* | The encoded string to analyse | *) |
|
mutable ulb_rawbuf_len : |
(* | The filled part of | *) |
|
mutable ulb_rawbuf_end : |
(* | The analysed part of | *) |
|
mutable ulb_rawbuf_const : |
(* | Whether | *) |
|
mutable ulb_chars : |
(* | The analysed part of | *) |
|
mutable ulb_chars_pos : |
(* | For every analysed character this array stores the
byte position where the character begins in This array is one element longer than | *) |
|
mutable ulb_chars_len : |
(* | The filled part of | *) |
|
mutable ulb_eof : |
(* | Whether EOF has been seen | *) |
|
mutable ulb_refill : |
(* | The refill function | *) |
|
mutable ulb_enc_change_hook : |
(* | This function is called when the encoding changes | *) |
|
mutable ulb_cursor : |
(* | Internally used by the implementation | *) |
val from_function : ?raw_size:int ->
?char_size:int ->
?enc_change_hook:(unicode_lexbuf -> unit) ->
refill:(Stdlib.Bytes.t -> int -> int -> int) ->
Netconversion.encoding -> unicode_lexbuf
Creates a unicode_lexbuf
to analyse strings of the
passed encoding
coming from the refill
function.
raw_size
: The initial size for ulb_rawbuf
. Defaults to 512char_size
: The initial size for ulb_chars
. Defaults to 256enc_change_hook
: This function is called when the encoding
is changed, either by this module, or by the user calling
set_encoding
.refill
: This function is called with arguments ulb_rawbuf
,
ulb_rawbuf_len
, and l
, where
l = String.length ulb_rawbuf - ulb_rawbuf_len
is the free
space in the buffer. The function should fill new bytes into
this substring, and return the number of added bytes. The
return value 0 signals EOF.val from_in_obj_channel : ?raw_size:int ->
?char_size:int ->
?enc_change_hook:(unicode_lexbuf -> unit) ->
Netconversion.encoding ->
Netchannels.in_obj_channel -> unicode_lexbuf
Creates a unicode_lexbuf
to analyse strings of the
passed encoding
coming from the object channel.
raw_size
: The initial size for ulb_rawbuf
. Defaults to 512char_size
: The initial size for ulb_chars
. Defaults to 256enc_change_hook
: This function is called when the encoding
is changed, either by this module, or by the user calling
set_encoding
.val from_string : ?enc_change_hook:(unicode_lexbuf -> unit) ->
Netconversion.encoding -> string -> unicode_lexbuf
Creates a unicode_lexbuf
analysing the passed string encoded in
the passed encoding. This function copies the input string.
enc_change_hook
: This function is called when the encoding
is changed, either by this module, or by the user calling
set_encoding
val from_bytes : ?enc_change_hook:(unicode_lexbuf -> unit) ->
Netconversion.encoding -> Stdlib.Bytes.t -> unicode_lexbuf
Same for bytes
val from_bytes_inplace : ?enc_change_hook:(unicode_lexbuf -> unit) ->
Netconversion.encoding -> Stdlib.Bytes.t -> unicode_lexbuf
Creates a unicode_lexbuf
analysing the passed string encoded in
the passed encoding. This function does not copy the input string,
but uses it directly as ulb_rawbuf
. The string is not modified by ULB
,
but the caller must ensure that other program parts do not
modify it either.
enc_change_hook
: This function is called when the encoding
is changed, either by this module, or by the user calling
set_encoding
Regarding from_string_inplace
, this function has been removed
as strings are now considered immutable.
val delete : int -> unicode_lexbuf -> unit
Deletes the number of characters from unicode_lexbuf
.
These characters
are removed from the beginning of the buffer, i.e.
ulb_chars.(n)
becomes the new first character of the
buffer. All three buffers ulb_rawbuf
, ulb_chars
, and
ulb_chars_pos
are blitted as necessary.
When the buffer is already at EOF, the function fails.
For efficiency, it should be tried to call delete
as seldom as
possible. Its speed is linear to the number of characters to move.
val refill : unicode_lexbuf -> unit
Tries to add characters to the unicode_lexbuf
by calling the
ulb_refill
function. When the buffer is already at EOF, the
exception End_of_file
is raised, and the buffer is not modified.
Otherwise, the ulb_refill
function is called to
add new characters. If necessary, ulb_rawbuf
, ulb_chars
, and
ulb_chars_pos
are enlarged such that it is ensured that either
at least one new character is added, or that EOF is found for
the first time
In the latter case, ulb_eof
is set to true
(and the next call
of refill_unicode_lexbuf
will raise End_of_file
).
val set_encoding : Netconversion.encoding -> unicode_lexbuf -> unit
Sets the encoding
to the passed value. This only affects future
refill
calls. The hook enc_change_hook
is invoked when defined.
val close : unicode_lexbuf -> unit
Sets ulb_eof
of the unicode_lexbuf
. The rest of the buffer
is not modified
val utf8_sub_string : int -> int -> unicode_lexbuf -> string
The two int
arguments are the position and length of a sub
string of the lexbuf that is returned as UTF8 string. Position
and length are given as character multiples, not byte multiples.
val utf8_sub_string_length : int -> int -> unicode_lexbuf -> int
Returns String.length(utf8_sub_string args)
. Tries not to
allocate the UTF-8 string.