Plasma GitLab Archive
Projects Blog Knowledge

{1:tutorial Netulex Tutorial}

Of course, you need Alain Frisch's [ulex] utility first. It installs
itself under the name [ulex] as findlib library.

Next, write your lexer, e.g. (line numbers in brackets):

{[
[1] open Netulex
[2] let digits = lexer
[3]   | ['0'-'9']+ -> `Number(int_of_string(Ulexing.utf8_lexeme lexbuf))
[4]   | 8364       -> `Euro_sign   (* Code point #8364 in Unicode *)
]}

This is a very trivial example. The lexer accepts sequences of digits,
and returns them as `Number tokens. Furthermore, the euro sign is
recognized and returned as `Euro_sign. Note that in the first case
{!Netulex.Ulexing.utf8_lexeme}
is called to retrieve the current lexeme as UTF-8 string. (Well,
digits are a bad example, as they are only ASCII, and UTF-8 is not
really needed. Imagine you want to extend the scanner to other
number systems represented in the Unicode character set.)

Line 1 is quite important. If you don't open [Netulex], the generated
[ulex] code will use the version of the [Ulexing] module coming with [ulex],
and not this one.

Call the lexer as follows (line numbers in brackets):
{[
[5] let sample = "42543\226\130\172";;
[6] let ulb = Netulex.ULB.from_string `Enc_utf8 sample;;
[7] let lexbuf = Netulex.Ulexing.from_ulb_lexbuf ulb;;
[8] let first_token = digits lexbuf;;
]}

Now, [first_token] is [`Number 42543]. After

{[
[9] let second_token = digits lexbuf;;
]}

this variable is set to [`Euro_sign], because the three-byte sequence
"\226\130\172" represents the euro sign in UTF-8.

In line 6, the encoding [`Enc_utf8] selects that [sample] is an
UTF-8 string. You can pass here any encoding the {!Netconversion}
module understands.

If you would like to scan from another source, just change line 6,
e.g.

{[
[6'] let ulb = Netulex.ULB.from_in_obj_channel ch
]}

where [ch] is any input channel the {!Netchannels} module supports.
For example, to read from a file:

{[
let ch = new Netchannels.input_channel (open_in "filename")
]}

You should compile the examples with

{[
ocamlfind ... -package ulex,netstring -syntax camlp4o ...
]}

For the syntax of the lexer rules, see the documentation coming
with [ulex].


This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml