update'
[cc1516.git] / deliverables / report / pars.tex
1 \section{Lexing \& parsing}\label{sec:pars}
2 \subsection{\Yard}
3 For both lexing and parsing we use proof of concept state of the art minimal
4 parser combinators library called \Yard. \Yard{} is inspired by the well-known
5 \textsc{Parsec} library. Where \Yard{} only has 9 parser combinators
6 \textsc{Parsec} already has 9 categories of parser combinators with in every
7 category there are numerous combinators.
8
9 \begin{lstlisting}[language=Clean]
10 :: Error = PositionalError Int Int String | Error String
11 :: Parser a b = Parser ([a] -> (Either Error b, [a]))
12
13 (<?>) :: (Parser a b) Error -> Parser a b
14 fail :: Parser a b
15 top :: Parser a a
16 peek :: Parser a a
17 satisfy :: (a -> Bool) -> Parser a a
18 check :: (a -> Bool) -> Parser a a
19 (until) infix 2 :: (Parser a b) (Parser a c) -> Parser a [b]
20 item :: a -> Parser a a | Eq a
21 list :: [a] -> Parser a [a] | Eq a
22 eof :: Parser a Void
23 \end{lstlisting}
24
25 \subsection{Lexing \& Grammar}
26 The existing \SPL{} grammar was not suitable for parsing and lexing. The
27 grammar listed in Listing~\ref{lst:grammar} shows the current and final
28 grammar. We have added several clauses that are explained in
29 Section~\ref{sec:ext} such as string literals and higher order functions.
30 Moreover we have applied transformation on the grammar including eliminating
31 left recursion, fixing operator priority and associativity.
32
33 To make the parsing more easy we first lex the character input into tokens. In
34 this way the parser can be simplified a lot and some more complex constructs
35 can be lexed as one token such as literal characters.
36
37 As said, the lexer uses a \Yard{} parser. The parser takes a list of characters
38 and returns a list of \CI{Token}s. A token is a \CI{TokenValue} accompanied
39 with a position and the \emph{ADT} used is show in Listing~\ref{lst:lextoken}.
40 Parser combinators make it very easy to account for arbitrary white space and
41 it is much less elegant to do this in a regular way.
42 Listing~\ref{lst:lexerwhite} shows the way we handle white space and recalculate
43 positions. By choosing to lex with parser combinators the speed of the phase
44 decreases. However due to the simplicity of the lexer this is barely
45 measurable.
46
47 \begin{lstlisting}[
48 language=Clean,
49 label={lst:lexerwhite},
50 caption={Lexer whitespace handling}]
51 lexProgram :: Int Int -> Parser Char [Token]
52 lexProgram line column = lexToken >>= \t->case t of
53 LexEOF = pure []
54 LexNL = lexProgram (line+1) 1
55 (LexSpace l c) = lexProgram (line+l) (column+c)
56 (LexItemError e) = fail <?>
57 PositionalError line column ("LexerError: " +++ e)
58 (LexToken c t) = lexProgram line (column+c)
59 >>= \rest->pure [({line=line,col=column}, t):rest]
60 \end{lstlisting}
61
62 \begin{lstlisting}[
63 language=Clean,
64 label={lst:lextoken},
65 caption={Lexer token ADT}]
66 :: Pos = {line :: Int, col :: Int}
67 :: Token :== (Pos, TokenValue)
68 :: TokenValue
69 //Value tokens
70 = IdentToken String
71 | NumberToken Int
72 | CharToken Char
73 | StringToken [Char]
74 //Keyword tokens
75 | VarToken
76 | ReturnToken
77 | IfToken
78 | ...
79 \end{lstlisting}
80
81 A lexer written with parser combinators is pretty straight forward and in
82 practice it is just a very long list of \texttt{alternative} monadic
83 combinators. However the order is very important since some tokens are very
84 greedy. Almost all tokens can be lexed trivially with the \CI{list} combinator
85 for matching literal strings. Comments and literals are the only exception that
86 require a non trivial parser.
87
88 \subsection{Parsing}
89 The parsing phase is the second phase of the parser and is yet again a \Yard{}
90 parser that transforms a list of tokens to an Abstract Syntax Tree(\AST). The
91 full abstract syntax tree is listed in Listing~\ref{lst:ast} which closely
92 resembles the grammar.
93
94 The parser uses the standard \Yard{} combinators. Due to the modularity of
95 combinators it was and is very easy to add functionality to the parser. The
96 parser also handles some syntactic sugar(Section~\ref{ssec:synsugar}). For
97 example the parser expands the literal lists and literal strings to the
98 according list or string representation. Moreover the parser transforms the let
99 expressions to real functions representing constant values.
100
101 As an example we show the parsing of a \CI{FunDecl} in
102 Listing~\ref{lst:fundecl}. The \CI{FunDecl} is one of the most complex parsers
103 and is composed of as complex subparsers. A \CI{FunDecl} parser exactly one
104 complete function.
105
106 First we do the non consuming \CI{peekPos} to make sure the position of
107 the function is guaranteed in the \AST{}, after that the identifier is parsed
108 which is basically a parser that transforms an \CI{IdToken} into a \CI{String}.
109
110 \CI{parseBBraces} is a parser combinator built from the basic combinators that
111 parses a parser when the contents are surrounded by braces. This is followed by
112 a yet another helper function to parse comma separated items.
113
114 Since the next element of the \ADT{} is a \CI{Maybe} the next parser is
115 combined with the optional combinator since the type of a function is optional
116 and when not given it will be inferred.
117
118 A function consists of \emph{many} \CI{VarDecl}s and then \emph{many}
119 \CI{Stmt}s. The word \emph{many} in \Yard{} means zero or more while the word
120 \emph{some} means one or more. Thus we allow a function to be empty. The reason
121 why \CI{flatten} is mapped on to the last argument of the \CI{liftM6} is
122 because the \CI{parseStmt} returns possibly multiple \CI{Stmt}s. This is
123 because in the parsing phase we expand the \CI{print} functions into multiple
124 instances and thus it can be the case that from one \CI{Stmt} we actually get
125 multiple.
126
127 \begin{lstlisting}[
128 language=Clean,
129 label={lst:fundecl},
130 caption={Parsing a \CI{FunDecl}}]
131 :: FunDecl = FunDecl Pos String [String] (Maybe Type) [VarDecl] [Stmt]
132
133 parseFunDecl :: Parser Token FunDecl
134 parseFunDecl = liftM6 FunDecl
135 (peekPos)
136 (parseIdent)
137 (parseBBraces $ parseSepList CommaToken parseIdent)
138 (optional (satTok DoubleColonToken *> parseFunType))
139 (satTok CBraceOpenToken *> many parseVarDecl)
140 (flatten <$> (many parseStmt <* satTok CBraceCloseToken))
141 \end{lstlisting}