A simplified URI strings parser written in SWI-Prolog.
The program allows you to parse URI strings that follow a simplified grammar compared to the RFC3986 standard. It checks that the string matches the grammar, and if so it returns its main components, otherwise it returns false.
The respected simplified grammar is as follows:
URI ::= URI1 | URI2
URI1 ::= scheme ‘:’ [authorithy] [‘/’ [path] [‘?’ query] [‘#’ fragment]]
URI2 ::= scheme ‘:’ scheme-syntax
scheme ::= <identifier>
authorithy ::= ‘//’ [userinfo ‘@’] host [‘:’ port]
userinfo ::= <identifier>
host ::= <host-identifier> [‘.’ <host-identifier>]* | IP-address
port ::= <digit>+
IP-address ::= <NNN.NNN.NNN.NNN - with N one digit and NNN < 255>
path ::= <identifier> [‘/’ <identifier>]*
query ::= <characters excluding ‘#’>+
fragment ::= <characters>+
<identifier> ::= <characters excluding ‘/’, ‘?’, ‘#’, ‘@’, and ‘:’>+
<host-identifier> ::= <characters excluding ‘.’, ‘/’, ‘?’, ‘#’, ‘@’, and ‘:’>+
<digit> ::= ‘0’ | ‘1’ | ‘2’ | ‘3’ | ‘4’ | ‘5’ | ‘6’ | ‘7’ | ‘8’ | ‘9’
scheme-syntax ::= <special syntax>
URI2 represents the grammar to be respected for some special schemes. String parsing is case-insensitive so patterns like _ "MaIlTo" _ or _ "MAILTO" _ are recognized as special patterns "mailto" and so on.
Special schemes:
mailto:
scheme-syntax ::= [userinfo ['@' host]]
news:
scheme-syntax ::= [host]
tel and fax:
scheme-syntax ::= [userinfo]
zos:
scheme-syntax ::= [authorithy] [‘/’ zospath [‘?’ query] [‘#’ fragment]]
zospath ::= <id44> [‘(’ <id8> ‘)’]
id44 ::= (<alphanumeric characters> | ‘.’) +
id8 ::= (<alphanumeric characters>) +
---
where both id44 and id8 must start with an alphabetic character and id44 cannot end with a period.
Furthermore, the length of id44 must not exceed 44 characters and that of id8 must not exceed 8.
The main components of a URI that are returned by the program are, in order:
Scheme, Userinfo, Host, Port, Path, Query, Fragment
In the structure returned by the program, the missing fields are returned as an empty list []
, instead if present they are returned as atoms, except for the port, which is returned as an integer (and as 80 by default if it's not present).
Note: port is returned as 80 by default even in cases where authority is not present at all and in special schemes that do not consider it.
uri_parse
is the main predicate with which you can parse a string.
In fact, it has 2 arguments, URIString
and URI
; the first is the string you want to parse, while the second is the resulting structure containing the components of the URI.
The predicate is not invertible but can answer queries with partially instantiated terms as per delivery.
Operation:
- The predicate first checks that the String passed to it is actually a string.
- It then converts the string to a character list using the
string_chars/2
predicate. - At this point, using
phrase/2
, it passes the list of characters touridcg
which takes care of the actual parsing and therefore of checking compliance with the grammar. - Finally, if the string passes the check, the resulting structure is returned, in the form
uri(Scheme, Userinfo, Host, Port, Path, Query, Fragment).
The uri_display
are 2 predicates that allow the printing of a URI structure on a stream.
uri_display/2
has 2 arguments, the URI structure and the stream it should print to, and uses the format/3
predicate to print the URI components on the stream it is passed in a way that they are clearly distinguishable .
uri_display/1
instead simply calls the predicate with 2 arguments, passing it current_output
as a stream which is nothing more than the terminal of SWI-Prolog, thus allowing a more immediate printing on the terminal.
The predicates do not check the validity of the components of the structure that is passed to them, but they check that the passed structure is of the indicated form and therefore has the 7 elements necessary for printing.
However, it is possible to use them in combination with the uri_parse
predicate in this way
uri_parse(URIString, URI), uri_display(URI) .
Passing it a URI in the form of a string. This way uri_parse
will take care of checking the validity of the string, and if so, uri_display
would print the resulting structure. If the string is invalid, false will be returned instead.
Note: using uri_display_1
will also print side effect of uri_parse
on the terminal.
As mentioned earlier, the main predicate uses the phrase/2
predicate to pass the list of characters obtained from the string to uridcg
which is a DCG.
A DCG, or Definite Clause Grammar, is nothing more than "syntactic sugar" for the normal Definite Clauses in Prolog, and in fact they can be translated into normal predicates if desired.
For example, the rules:
a --> a, b.
a --> [a].
b --> [b].
They can be translated into
a(A, C) :- a(A, B), b(B, C).
a([a|X], X).
b([b|X], X).
However, they greatly simplify the writing and readability of a program, and are particularly useful for describing a grammar and therefore also for parsing strings in general.
The rules of a DCG are made up, as in the example, of a head (head) and a body (body), separated by "-->".
In the body there may be other rules, terminal characters enclosed in square brackets ([]
), and in braces ({}
) you can add some normal Prolog code.
In the program, the DCGs in addition to checking if the list of characters respects the grammar they define, also return what respects it (except the delimiters), doing in this way much of the work and reconstructing the necessary elements by backtracking.
The use of numerous green cuts is also present to avoid that the query tries to unify with other predicates and returns only one result.
This rule defines the grammar of URI1 and URI2, checks their correctness on a list of characters and returns their structure
uri(Scheme, Userinfo, Host, Port, Path, Query, Fragment)
As far as URI1 is concerned, 2 predicates are defined, one in which there is the block [‘/’ [path] [‘?’ Query] [‘#’ fragment]]
and one without.
The optionality of [authority]
is instead managed in authority itself.
It also makes use of the additional predicate not_special/1
to check that the parsed scheme
does not belong to special schemes.
As for URI2, on the other hand, there are multiple predicates for each special scheme, which define their scheme-syntax and which exploit the downcase_atom/2
predicate to temporarily convert the obtained scheme into lowercase and check if it is equal or not to the corresponding special scheme.
The use of the predicate is necessary to implement the case-insensitive of parsing.
They are intermediate DCGs necessary for the conversion of path, query and fragment to atom through the use of the atom_chars/2
predicate, but also for checking the presence of the latter.
The predicates uriquery and urifragment also respectively check for the presence of the '?' and '#' characters before query and fragment.
Finally, if present, they call the path, query and fragment DCGs to check their correctness and the absence of invalid characters.
Instead, if they're not present, each predicate has a version which if it encounters the empty list, it returns the empty list.
The same reasoning applies to mail_host and news_host for their respective special schemes and for auth_port for authority. The only difference is the use of number_chars/2
in auth_port to return the port as an integer instead of an atom.
It simply calls id and converts the result of id to atom using atom_chars again.
3 predicates are implemented:
- the first 2 represent the authority grammar, in one there is userinfo while in the other no (the optionality of [':' port] is managed in auth_port)
- the last one unifies with the case in which authority is not present and is therefore "empty".
It behaves exactly like scheme.
It has 3 predicates:
- the first checks if the host by chance unifies with ip, in this case the green cut prevents it from unifying with the following ones
- the other 2 instead recursively define the other host definition based on id_host, which is why you need to use a
flatten/2
to get a single list of characters, since host itself returns a list of characters.
Recursively defines port as a sequence of digit.
Check that the list of characters respects the grammar of 4 triplets of digit separated by periods as per definition.
It also uses the check_ip/1
predicate which in turn relies on number_string/2
to check if any triplet converted to integer is <= 255.
Note: number_string
also works on character lists, as per Official SWI-Prolog Documentation section 5.2.2.
In any case, a multi-step conversion could have been used using first string_chars
on the character list and then number_string
on the resulting string.
It recursively defines the path and as for host makes use of flatten/2
to avoid nested lists.
It recursively defines queries and checks that each character is different from '#' and therefore valid.
Simply scan and go through the list of characters as is, each character being valid.
It recursively defines id and also uses is_id/1
to check that parsed characters are valid and respect grammar.
It behaves the same as id and uses is_id_host/1
to check its validity.
Check that the character passed is a digit from 0 to 9 via the is_digit/1
predicate.
It has 2 predicates, one with the presence of id8 in round brackets and the other with only id44.
Also in zospath the correct length of id44 and id8 is checked, and flatten/2
and atom_chars/2
are used to get a single character list and convert it to atom.
The predicate itself only checks that the first character is an alphabetic character using the predicate char_type/2
with the second argument alpha
, after which if other characters are present it calls rest_id44, which checks that the rest is made up of alphanumeric characters (via id_alnum) or dots '.'.
The recursive definition of rest_id44 does not allow the string to end with a period, it also uses flatten/2
always to avoid nested lists.
Check that the first character is alphabetic again using char_type/2
with alpha
, finally check that the remainder if present is alphanumeric using id_alnum.
Use flatten/2
to avoid nested lists.
Check that the list is made up of only alphanumeric characters using the char_type/2
predicate with the second argument alnum
. Given the recursive definition it uses flatten/2
to get a character-only list, with no nested lists.
The testing is carried out through automatic tests written following the documentation on Prolog Unit Tests.
The test file uri-parse.plt
contains 1309 test cases which the program passes successfully.
It is possible to customize and write your own test file in .plt
extension starting from the file already present.
The line of code has been added at the beginning of the main code of the uri-parse.pl
project
:- module(uri_parse, [uri_parse/2]).
In order to start the tests.
Successful tests are of the type
test(true_scheme) :- uri_parse("s:", uri ('s', [], [], 80, [], [], [])).
And those with negative results
test(false_scheme) :- \+ uri_parse("s", _).
As you can see, the \+
operator is used which returns true.
when it is not possible to unify and resolve a predicate test (and the terminal therefore returns false.
).
The general structure is therefore
test(test_id) :- predicate(input, result). | \+ predicate(input, _).
Where:
id_test
is the name given to the test, useful for understanding which tests failpredicate
is the name of the predicate being testedinput
is the input that the predicate takesresult
is the correct result obtained with the passed input\+
is used when a result cannot be obtained with the passed input (_
)
To run custom tests, make sure that the .pl
file and the .plt
test file are in the same directory.
Then mark the path of the directory where you put the program and the test files, which on windows will be of the type
C:/Users/.../Directory
At this point, open SWI-Prolog and run the command:
working_directory(OldDir, 'NewDir').
Which will allow you to change to the new NewDir
directory.
NewDir
must obviously be replaced with the path of the folder where you put the files and that you have previously saved.
Then run the following commands:
['uri-parse'].
load_test_files([]).
run_tests.
Alternatively, you can also run a single single command:
['uri-parse'], load_test_files([]), run_tests.
Note: in this case the command is ['uri-parse']
, but more generally it would be ['test_file_name']
with the name of your test file .plt
as test_file_name
.