In this article, I present a brief introduction to parsing combinators that are designed to facilitate parser implementation in CPS-style. This is also an occassion for me to advocate the use of abstract types in programming via a concrete case.
Let us first take a look at the following two type definitions:
// typedef parcont (a:t@ype) = (a, parinp) -<cloref1> parout // typedef parinp_nullify (a:t@ype) = (parinp, parcont(a)) -<cloref1> parout //The types parinp and parout are abstract at this stage. Intuitively, parinp is meant to be the type for parsing input and parout the type for parsing output. Given a type T, the type parcont(T) is for a continuation taking a value of the type T and another value representing parsing input; the type parinp_nullify(T) is for a parser of CPS-style that parses some input into a value of the type T and then passes the value and the remaining input to a continuation of the type parcont(T). If a parser of the type parinp_nullify(T) cannot turn the given parsing input into a value of the type T, then a raised exception ParFailExn() is issued as the parsing result.
The type constructor kparser is declared to be abstract. Given a type T, kparser(T) and parcont(T) are two types that are equivalent:
// extern castfn kparser_encode {a:t@ype}(parinp_nullify(a)): kparser(a) // extern castfn kparser_decode{a:t@ype}(kparser(a)): parinp_nullify(a) //Each value of the type kparser(T) represents a parser of CPS-style. For the sake of brevity, I will only state that such a parser parses some input into a value of the type T, omitting the part that the value obtained from parsing and the remaining input need to be passed to a continuation of the type parcont(T).
Generally speaking, parsing combinators are functions designed for constructing parsers. While a parsing combinator is often called on some existing parsers to construct a new parser, it is not uncommon to see one that takes as its arguments values that are not parsers. I present as follows some common parsing combinators.
Two combinators kparser_fail and kparser_just are given the following types:
// fun {a:t@ype} kparser_fail(): kparser(a) // fun {a:t@ype} kparser_just(x: a): kparser(a) //Calling kparser_fail simply raises an exception. Given a value, kparser_just simply returns the value without consuming any input.
Given a parser and a predicate, the combinator kparser_satisfy constructs another parser that only returns if the value returned by the given parser satisfies the predicate:
// fun {a:t@ype} kparser_satisfy ( kparser(a), test: cfun1(a, bool) ) : kparser(a) // end-of-fun //
Given two parsers, the combinator kparser_join2 constructs a new parser that essentially calls the two parsers consecutively and then combines the values returned by them to form a tuple:
// fun {a1 ,a2:t@ype} kparser_join2 ( kp1: kparser(a1), kp2: kparser(a2) ) : kparser(@(a1, a2)) // end-of-fun //
Given a parser and a function, the combinator kparser_fmap constructs another parser that returns a value obtained from applying the given function to the value returned by the given parser:
// fun {a:t@ype} {b:t@ype} kparser_fmap (kp: kparser(a), fmap: cfun1(a, b)) : kparser(b) //
The combinator kparser_fmap2 is essentially a composition of kparser_join2 and kparser_fmap:
// fun {a1 ,a2:t@ype} {b:t@ype} kparser_fmap2 (kp1: kparser(a1), kp2: kparser(a2), fmap: cfun1(a1, a2, b)): kparser(b) //
Given two parsers, the combinator kparser_orelse constructs a new parser that calls the first parser and then, if parsing fails, calls the second parser:
// fun {a:t@ype} kparser_orelse (kp1: kparser(a), kp2: kparser(a)): kparser(a) //If the second parser is called, then the input passed to it is the same as the input passed to the first one.
Given a parser, the combinator kparser_repeat0 constructs another one that applies the given parser repeatedly (until a parsing failure occurs):
// fun {a:t@ype} kparser_repeat0(kp: kparser(a)): kparser(List0(a)) //The values obtained from applying the given parser repeatedly are gathered in a list and then the list is returned. The combinator kparser_repeat1 is a slight variant of kparser_repeat0:
// fun {a:t@ype} kparser_repeat1(kp: kparser(a)): kparser(List1(a)) //Given a parser, kparser_repeat1 constructs another one that applies the given parser to obtain a value and then applies it repeatedly (until a parsing failure occurs).
Let us now see certain combinators desiged to construct parsers for parsing a sequence of characters (of the type char):
(* ****** ****** *) // fun{} kparser_char((*void*)): kparser(char) // (* ****** ****** *) // fun{} kparser_alpha((*void*)): kparser(char) fun{} kparser_alnum((*void*)): kparser(char) fun{} kparser_digit((*void*)): kparser(char) // (* ****** ****** *) // fun{} kparser_litchar(c0: char): kparser(char) fun{} kparser_literal(lit: string): kparser(int) // (* ****** ****** *)The combinator kparser_char cannot be implemented yet at this stage as the type parinp is abstract. The rest of the combinators can be implemented as follows:
(* ****** ****** *) implement {}(*tmp*) kparser_alpha ((*void*)) = let val kp = kparser_char<>() in kparser_satisfy(kp, lam(c) => isalpha(c)) end // end of [kparser_alpha] (* ****** ****** *) implement {}(*tmp*) kparser_alnum ((*void*)) = let val kp = kparser_char<>() in kparser_satisfy(kp, lam(c) => isalnum(c)) end // end of [kparser_alnum] (* ****** ****** *) implement {}(*tmp*) kparser_digit ((*void*)) = let val kp = kparser_char<>() in kparser_satisfy(kp, lam(c) => isdigit(c)) end // end of [kparser_digit] (* ****** ****** *) implement {}(*tmp*) kparser_litchar (c0) = let val kp = kparser_char<>() in kparser_satisfy<char>(kp, lam(c1) => c0 = c1) end // end of [kparser_litchar] (* ****** ****** *) implement {}(*tmp*) kparser_literal (lit) = let // val [n:int] lit = g1ofg0(lit) // val ncs = sz2i(string_length(lit)) // fun loop(i: natLte(n)): kparser(int) = if i < ncs then kparser_litchar(lit[i]) >> loop(i+1) else kparser_just(0) // end of [if] // in loop(0) end // end of [kparser_literal] (* ****** ****** *)Given a character ch, kparser_litchar(ch) returns a parser for recognizing the character. Given a string str, kparser_literal(str) returns a parser for recognizing the sequence of the characters contained in the string str.
For those interested in studying parsing combinators in more depth, the following links should be helpful:
In the rest of this article, I briefly mention the implementation of a tokenizer that turns a sequence of characters into tokens, showing a concrete case of parsing combinators in action.
Assume that parinp equals charlst (which is defined as List0(char)). That is, parsing input is just a list of characters. Then the combinator kparser_char can be implemented as follows:
// assume parinp_type = List0(char) // implement {}(*tmp*) kparser_char ((*void*)) = kparser_encode {char}( // lam(inp, kont) => ( case+ inp of | cons(c, inp2) => kont(c, inp2) | nil() => $raise ParFailExn() ) // ) (* kparser_char *) //
Assume that a legal integer token consists of a non-empty sequence of digits. Then a parser for integer tokens can be implement as follows:
// val kp_digit = kparser_digit<>() // val kp_int = kparser_fmap<charlst><int> ( kparser_repeat1(kp_digit), lam(cs) => charlst2int(cs) ) //
Assume that a legal identifier token consists of a letter followed by a (possibly empty) sequence of letters or digits. Then a parser for identifiers can be implement as follows:
// val kp_alpha = kparser_alpha<>() val kp_alnum = Kparser_alnum<>() // val kp_ident = // parsing identifiers kparser_fmap2<char,charlst><string> ( kp_alpha , kparser_repeat0(kp_alnum), lam(c, cs) => charlst2str(cons(c, cs)) ) //
The datatype token is declared for tokens as follows:
// datatype token = | TOKeof of () // the end | TOKint of int // integer | TOKident of string // identifier | TOKspchr of char // special char | TOKcomment of () // comment //The parser kp_token for turning sequences of characters into tokens is implemented below:
// val kp_TOKeof = kparser_fmap(kp_eof, lam(x) => TOKeof()) // val kp_TOKint = kparser_fmap(kp_int, lam(x) => TOKint(x)) // val kp_TOKident = kparser_fmap(kp_ident, lam(x) => TOKident(x)) // val kp_TOKspchr = kparser_fmap(kp_spchr, lam(x) => TOKspchr(x)) // val kp_TOKcomment = kparser_fmap(kp_comment, lam(x) => TOKcomment()) // val kp_token = kp_TOKint || kp_TOKident || kp_TOKcomment || kp_TOKspchr || kp_TOKeof //Note that kp_eof, kp_spchr and kp_comment are not presented here. A comment consisits of a sequence of characters starting with (* and ending with *), and embedded comments are allowed. I suggest that the reader pay close attention to the implementation of kp_comment in tokenizer.dats, which is not based on parsing combinators. Instead, kp_comment is implemented in a direct style. Parsing combinators can often help implement elegant parsers but not always. There is really no point sticking to parsing combinators if one can find a simpler and/or cleaner solution otherwise.
A tokenizer is implemented as follows that turns a list of characters into a (lazy) stream of tokens:
// assume parout_type = stream_con(token) // fun tokenizer ( cs: List0(char) ) : stream(token) = let // val kp_token = kparser_decode(kp_token) // in // $delay ( // case+ cs of | nil() => stream_nil() | cons _ => kp_token(cs, lam(tok, inp) => stream_cons(tok, tokenizer(inp))) // ) : stream_con(token) // end // end of [tokenizer] //
Please find the entirety of the code for this example in the file tokenizer.dats. A good exercise to solidify one's understanding of parsing combinators of CPS-style is to simply implement a parser for arithmetic expressions based on the tokenizer presented here.
This article is written by Hongwei Xi.