c7df61997ac23728138c22f47c32a65268972605
[cc1516.git] / deliverables / report / pars.tex
1 \section{Lexing \& parsing}\label{sec:pars}
2 \subsection{\Yard}
3 For both lexing and parsing we use proof of concept state of the art minimal
4 parser combinators library called \Yard. \Yard{} is inspired by the well-known
5 \textsc{Parsec} library. Where \Yard{} only has 9 parser combinators
6 \textsc{Parsec} already has 9 categories of parser combinators with in every
7 category there are numerous combinators.
8
9 \begin{lstlisting}[language=Clean]
10 :: Error = PositionalError Int Int String | Error String
11 :: Parser a b = Parser ([a] -> (Either Error b, [a]))
12
13 (<?>) :: (Parser a b) Error -> Parser a b
14 fail :: Parser a b
15 top :: Parser a a
16 peek :: Parser a a
17 satisfy :: (a -> Bool) -> Parser a a
18 check :: (a -> Bool) -> Parser a a
19 (until) infix 2 :: (Parser a b) (Parser a c) -> Parser a [b]
20 item :: a -> Parser a a | Eq a
21 list :: [a] -> Parser a [a] | Eq a
22 eof :: Parser a Void
23 \end{lstlisting}
24
25 \subsection{Lexing \& Grammar}
26 The existing \SPL{} grammar was not suitable for parsing and lexing. The
27 grammar listed in Listing~\ref{lst:grammar} shows the current and final
28 grammar. We have added several clauses that are explained in
29 Section~\ref{sec:ext} such as string literals and higher order functions. The
30 transformation of the grammar include eliminating left recursion, fixing
31 operator priority and associativity.
32
33 To make the parsing more easy we first lex the character input into tokens. In
34 this way the parser can be simplified a lot and some more complex constructs
35 can be lexed as one token such as literal characters.
36
37 As said, the lexer uses a \Yard{} parser. The parser takes a list of characters
38 and returns a list of \CI{Token}s. A token is a \CI{TokenValue} accompanied
39 with a position and the \emph{ADT} used is show in Listing~\ref{lst:lextoken}.
40 Parser combinators make it very easy to account for arbitrary white space and it
41 is much less elegant to do this in a regular way. By choosing to lex with
42 parser combinators the speed of the phase decreases. However, since the parsers
43 are very easy this increase is very small.
44
45 \begin{lstlisting}[
46 language=Clean,
47 label={lst:lextoken},
48 caption={Lexer token ADT}]
49 :: Pos = {line :: Int, col :: Int}
50 :: Token :== (Pos, TokenValue)
51 :: TokenValue
52 //Value tokens
53 = IdentToken String
54 | NumberToken Int
55 | CharToken Char
56 | StringToken [Char]
57 //Keyword tokens
58 | VarToken
59 | ReturnToken
60 | IfToken
61 | ...
62 \end{lstlisting}
63
64 A lexer written with parser combinators is pretty straight forward and in
65 practice it is just a very long list of \texttt{alternative} monadic
66 combinators. However the order is very important since some tokens are very
67 greedy. Almost all tokens can be lexed trivially with the \CI{list} combinator
68 for matching literal strings. Comments and literals are the only exception that
69 require a non trivial parser.
70
71 \subsection{Parsing}
72 The parsing phase is the second phase of the parser and is yet again a \Yard{}
73 parser that transforms a list of tokens to an Abstract Syntax Tree(\AST). The
74 full abstract syntax tree is listed in Listing~\ref{lst:ast} which closely
75 resembles the grammar.
76
77 The parser uses the standard \Yard{} combinators. For clarity and easy the
78 parser closely resembles the grammar. Due to the modularity of combinators it
79 was and is very easy to add functionality to the parser. The parser also
80 handles some syntactic sugar(Section~\ref{ssec:synsugar}). For example the
81 parser expands the literal lists and literal strings to the according list or
82 string representation. Moreover the parser transforms the let expressions to
83 real functions representing constant values.