module Neturl:Uniform Resource Locators (URLs)sig
..end
Contents
The tutorial has been moved to
Neturl_tut
.This module provides functions to parse URLs, to print URLs, to store URLs, to modify URLs, and to apply relative URLs.
URLs are strings formed according to pattern (1) or (2):
scheme://user;userparams:password@host:port/path;params?query#fragment
scheme:other;params?query#fragment
url_syntax
whose values describe which parts are allowed/required/
not allowed for a concrete URL scheme (see below).
Not all characters are allowed in a URL. Some characters are allowed,
but have the special task to separate the various parts of the URL
(reserved characters).
However, it is possible to include even invalid or reserved characters
as normal content by applying the %
-encoding on these characters:
A '%'
indicates that an encoded character follows, and the character
is denoted by a two-digit hexadecimal number (e.g. %2f
for '/'
).
In the following descriptions, the term "encoded string" means a string
containing such %
-encoded characters, and the "decoded string" means a
string not containing such characters.
See the module Netencoding.Url
for functions encoding or decoding
strings.
The type url
describes values storing the components of a URL,
and the url_syntax
for the URL. In general, the components are
stored as encoded strings; however, not for all components the
%
-encoding is applicable.
For convenience, the functions creating, modifying, and accessing URLs can handle both encoded and decoded strings. In order to avoid errors, the functions pass strings even in their decoded form.
Note that there is currently no function to compare URLs. The
canoncical comparison ( =
) is not applicable because the same URL
may be written in different ways.
Note that nothing is said about the character set/encoding of URLs.
Some protocols and standards prefer UTF-8 as fundamental encoding
and apply the %
-encoding on top of it; i.e. the byte sequence
representing a character in UTF-8 is %
-encoded.
Standards Compliance
This module implements RFC 1738 and RFC 1808. There is also a newer RFC, 2396, updating the former RFCs, but this module is not fully compatible with RFC 2396. The following (minor) problems may occur:
imap
URLs.
"http://host?query"
. This is illegal in RFC 1738. The consequence
is, however, that question marks in user strings must be escaped.exception Malformed_URL
val extract_url_scheme : string -> string
extract_url_scheme "http://host/path" = "http"
.
The scheme name is always converted to lowercase characters.
Raises Malformed_URL
if the scheme name is not found.type
url_syntax_option =
| |
Url_part_not_recognized |
(* | The part, even if there, is not even recognized | *) |
| |
Url_part_allowed |
(* | The part can be present | *) |
| |
Url_part_required |
(* | The part must be present | *) |
type
url_syntax = {
|
url_enable_scheme : |
|
url_enable_user : |
|
url_enable_user_param : |
|
url_enable_password : |
|
url_enable_host : |
|
url_enable_port : |
|
url_enable_path : |
|
url_enable_param : |
|
url_enable_query : |
|
url_enable_fragment : |
|
url_enable_other : |
|
url_accepts_8bits : |
|
url_is_valid : |
|
url_enable_relative : |
url_syntax
describe which components of an URL are
recognized, which are allowed (and optional), and which are required.
Not all combinations are valid; the predicate expressed by the
function url_syntax_is_valid
must hold.
The function url_is_valid
is applied when a fresh URL is created
and must return true
. This function allows it to add an arbitrary
validity criterion to url_syntax
. (Note that the URL passed to
this function is not fully working; you can safely assume that the
accessor functions url_scheme
etc. can be applied to it.)
Switch url_accepts_8bit
: If true
, the bytes with code 128 to
255 are treated like alphanumeric characters; if false
these bytes
are illegal (but it is still possible to include such byte in their
encoded form: %80
to %FF
).
Switch url_enable_relative
: If true
, the syntax allows relative
URLs in principle. Actually, parsing of relative URLs is possible
when the optional parts are flagged as Url_part_allowed
and not
as Url_part_required
. However, it is useful to specify URL syntaxes
always as absolute URLs, and to weaken them on demand when a relative
URL is found by the parser. This switch enables that. In particular,
the function partial_url_syntax
checks this flag.
type
url
url
describe concrete URLs. Every URL must have
a fundamental url_syntax
, and it is only possible to create URLs
conforming to the syntax. See make_url
for further information.val url_syntax_is_valid : url_syntax -> bool
url_syntax
is valid. This means:val partial_url_syntax : url_syntax -> url_syntax
val null_url_syntax : url_syntax
let my_syntax = { null_url_syntax with
url_enable_host = Url_part_required; ... }
val ip_url_syntax : url_syntax
val common_url_syntax : (string, url_syntax) Hashtbl.t
"file"
: scheme, host?, path"ftp"
: scheme, user?, password?, host, port?, path?, param?
Note: param is not checked."http"
, "https"
:
scheme, user?, password?, host, port?, path?, query?"mailto"
: scheme, other, query? (RFC 2368)"pop"
, "pops"
: scheme, user?, user_param?, password?, host, port?
Note: user_param is not checked.
(RFC 2384)"imap"
, "imaps"
: scheme, user?, user_param?, password?, host, port?,
path?, query? (RFC 2192)
Note: "param" is intentionally not recognized to get the resolution of
relative URLs as described in the RFC. When analysing this kind of URL,
it is recommended to re-parse it with "param" enabled."news"
: scheme, other (RFC 1738)"nntp"
, "nntps"
: scheme, host, port?, path (with two components)
(RFC 1738)"data"
: scheme, other (RFC 2397). "other" is not further decomposed."ipp"
, "ipps"
: scheme, host, port? , path?, query? (RFC 3510)"cid"
, "mid"
: Content/message identifiers: scheme, otherpartial_url_syntax
.url_enable_fragment
to Url_part_allowed
. E.g.
{ file_url_syntax with url_enable_fragment = Url_part_allowed }
val null_url : url
null_url_syntax
val make_url : ?encoded:bool ->
?scheme:string ->
?user:string ->
?user_param:string list ->
?password:string ->
?host:string ->
?port:int ->
?path:string list ->
?param:string list ->
?query:string ->
?fragment:string -> ?other:string -> url_syntax -> url
scheme
and host
are simple strings to which the
%
-encoding is not applicable.port
is a simple number. Of course, the %
-encoding
is not applicable, too.user
, password
, query
, fragment
, and other
are strings which may contain %
-encoded characters. By default,
you can pass any string for these components, and problematic characters
are automatically encoded. If you set encoded:true
, the passed
strings must already be encoded, but the function checks whether
the encoding is syntactically correct.
Note that for query
even the characters '?'
and '='
are encoded
by default, so you need to set encoded:true
to pass a reasonable
query string.user_param
, path
and param
are lists of strings which may
contain %
-encoded characters. Again, the default is to pass
decoded strings to the function, and the function encodes them
automatically, and by setting encoded:true
the caller is responsible
for encoding the strings. Passing empty lists for these components
means that they are not part of the constructed URL.
See below for the respresentation of these components.
The created URL must conform to the url_syntax
, i.e.:
url_is_valid
function of the syntax.'/'
-separated path
components. i.e.
[ s1; s2; ...; sN ]
represents the path
s1 ^ "/" ^ s2 ^ "/" ^ ... ^ "/" ^ sN
As special cases:
[]
is the non-existing path[ "" ]
is "/"
[ "";"" ]
is illegals1
and sN
, the path components must not be empty strings.
To avoid ambiguities, it is illegal to create URLs with both relative
paths (s1 <> ""
) and host components.
Parameters of URLs (param
and user_param
) are components
beginning with ';'
. The list
of parameters is represented as list of strings where the strings
contain the value following ';'
.
val modify_url : ?syntax:url_syntax ->
?encoded:bool ->
?scheme:string ->
?user:string ->
?user_param:string list ->
?password:string ->
?host:string ->
?port:int ->
?path:string list ->
?param:string list ->
?query:string ->
?fragment:string -> ?other:string -> url -> url
val remove_from_url : ?scheme:bool ->
?user:bool ->
?user_param:bool ->
?password:bool ->
?host:bool ->
?port:bool ->
?path:bool ->
?param:bool ->
?query:bool -> ?fragment:bool -> ?other:bool -> url -> url
true
components from the URL, and returns the modified
URL.
The modfied URL shares unmodified components with the original
URL.val default_url : ?encoded:bool ->
?scheme:string ->
?user:string ->
?user_param:string list ->
?password:string ->
?host:string ->
?port:int ->
?path:string list ->
?param:string list ->
?query:string ->
?fragment:string -> ?other:string -> url -> url
val undefault_url : ?scheme:string ->
?user:string ->
?user_param:string list ->
?password:string ->
?host:string ->
?port:int ->
?path:string list ->
?param:string list ->
?query:string ->
?fragment:string -> ?other:string -> url -> url
val url_syntax_of_url : url -> url_syntax
url_syntax
record of a URL.val url_of_string : url_syntax -> string -> url
url_syntax
.val string_of_url : url -> string
val parse_url : ?schemes:(string, url_syntax) Hashtbl.t ->
?base_syntax:url_syntax ->
?accept_8bits:bool -> ?enable_fragment:bool -> string -> url
schemes
.
If the URL is relative, the base_syntax
will be taken
if passed. Without base_syntax
, relative URLs cannot be
parsed.
schemes
: This hashtable maps scheme names to syntax descriptions.
The default is common_url_syntax
.base_syntax
: If passed, the function can parse relative URLs
according to this syntax. If not passed, the function will raise
Malformed_URL
on a relative URL.accept_8bits
: If false
, the default, it depends on the
syntax descriptions in schemes
whether 8 bit characters are
accepted in the input or not. If true
, 8 bit characters are
always accepted.enable_fragment
: If false
, the default, it depends on the
syntax descriptions in schemes
whether fragment identifiers
(e.g. "#fragment") are recognized or not. If true
, fragments
are always recognized.val fixup_url_string : ?escape_hash:bool -> string -> string
If escape_hash
is set, '#' is also escaped.
val url_provides : ?scheme:bool ->
?user:bool ->
?user_param:bool ->
?password:bool ->
?host:bool ->
?port:bool ->
?path:bool ->
?param:bool ->
?query:bool -> ?fragment:bool -> ?other:bool -> url -> bool
true
iff the URL has all of the components passed with
true
value.val url_scheme : url -> string
val url_user : ?encoded:bool -> url -> string
val url_user_param : ?encoded:bool -> url -> string list
val url_password : ?encoded:bool -> url -> string
val url_host : url -> string
val url_port : url -> int
val url_path : ?encoded:bool -> url -> string list
val url_param : ?encoded:bool -> url -> string list
val url_query : ?encoded:bool -> url -> string
val url_fragment : ?encoded:bool -> url -> string
val url_other : ?encoded:bool -> url -> string
encoded:true
is set.
If the component does not exist, the exception Not_found
is raised.val split_path : string -> string list
'/'
-separated path into components (e.g. to set up the
path
argument of make_url
).
E.g.
split_path "a/b/c" = [ "a"; "b"; "c" ],
split_path "/a/b" = [ ""; "a"; "b" ],
split_path "a/b/" = [ "a"; "b"; "" ]
Beware that split_path ".."
returns [".."]
while split_path "../"
returns [".."; ""]
. The two will behave differently, for example
when used with Neturl.apply_relative_url
.val join_path : string list -> string
val norm_path : string list -> string list
"."
and ".."
from the path if possible. Deletes double slashes.
Examples
norm_path ["."] = []
means: "." = ""
norm_path ["."; ""] = []
means: "./" = ""
norm_path ["a"; "."] = ["a"; ""]
means: "a/." = "a/"
norm_path ["a"; "b"; "."] = ["a"; "b"; ""]
means: "a/b/." = "a/b/"
norm_path ["a"; "."; "b"; "."] = ["a"; "b"; ""]
means: "a/./b/." = "a/b/"
norm_path [".."] = [".."; ""]
means: ".." = "../"
norm_path [".."; ""] = [".."; ""]
means: "../" = "../"
norm_path ["a"; "b"; ".."; "c" ] = ["a"; "c"]
means: "a/b/../c" = "a/c"
norm_path ["a"; "b"; ".."; "c"; ""] = ["a"; "c"; ""]
means: "a/b/../c/" = "a/c/"
norm_path ["";"";"a";"";"b"] = [""; "a"; "b"]
means: "//a//b" = "/a/b"
norm_path ["a"; "b"; ""; ".."; "c"; ""] = ["a"; "c"; ""]
means: "a/b//../c/" = "a/c/"
norm_path ["a"; ".."] = []
means: "a/.." = ""
val apply_relative_url : url -> url -> url
apply_relative_url base rel
:
Interprets rel
relative to base
and returns the new URL. This
function implements RFC 1808.
It is not necessary that rel
has the same syntax as base
.
Note, however, that it is checked whether the resulting URL is
syntactically correct with the syntax of base
. If not, the
exception Malformed_URL
will be raised.
Examples (the URLs are represented as strings, see Neturl.split_path
to split them for Neturl.make_url
):
base="x/y", url="a/b" => result="x/a/b"
base="x/y/", url="a/b" => result="x/y/a/b"
base="x/y/..", url="a/b" => result="x/y/a/b" (beware!)
base="x/y/../", url="a/b" => result="x/a/b"
val ensure_absolute_url : ?base:url -> url -> url
base
. If there is no base
or if the the base URL does not allow the parts that would be added
(e.g. if the anonymous URL possesses a fragment and base
does not
allow that), this will fail, and the function raises Malformed_URL
.val file_url_of_local_path : ?getcwd:(unit -> string) -> string -> url
Note that no character set conversions are performed.
Win32: The input path name may use forward or backward slashes.
Absolute paths with drive letters and UNC paths are recognised.
Relative paths with drive letters, however, are not recognised
(e.g. "c:file"
), as it is not possible to access the drive-specific
working directory from the O'Caml runtime.
Cygwin: The input path name may use forward or backward slashes.
Absolute paths with drive letters and UNC paths are recognised.
The former are translated to "/cygdrive"
names.
getcwd
: The function returns the path taken as current working
directory. Note that for
Win32 this must be either an absolute name with drive letter,
or an UNC path. Default: Sys.getcwd
val local_path_of_file_url : url -> string
If the URL is not a file URL, or is not absolute, the function will fail.
Win32: The URL must either contain a drive letter, or must refer to another host.
Cygwin: Drive letters and remote URLs are recognised.
val print_url : url -> unit