8f290447080ce7639a5258a96f6d725aacbe49ae
[cc1516.git] / deliverables / report / pars.tex
1 \section{Lexing \& parsing}\label{sec:pars}
2 \subsection{\Yard}
3 For both lexing and parsing we use proof of concept
4 %state of the art
5 minimal
6 parser combinators library called \Yard. \Yard{} is inspired by the well-known
7 \textsc{Parsec} library. Where \Yard{} only has 7 primitive parser (combinators)
8 \textsc{Parsec} already has 9 categories of parser combinators with in every
9 category there are numerous combinators.
10
11 Parsers implemented in Yard parse left to right and are greedy, that is they
12 always try to consume as
13 much input as possible, starting from the left most encountered token.
14 Yards parsers also automatically apply backtracking,
15 when the left parser combined with \CI{<|>} fails, then any input it might
16 have consumed is restored and the right parser is executed.
17
18 \begin{lstlisting}[language=Clean]
19 :: Error = PositionalError Int Int String | Error String
20 :: Parser a b = Parser ([a] -> (Either Error b, [a]))
21
22 (<?>) :: (Parser a b) Error -> Parser a b
23 fail :: Parser a b
24 top :: Parser a a
25 peek :: Parser a a
26 eof :: Parser a Void
27 pure :: b -> Parser a b
28 (>>=) :: (Parser a b) (b -> Parser a c) -> Parser a c
29 (<|>) :: (Parer a b) (Parser a b) -> Parser a b
30 --derived parser functions
31 satisfy :: (a -> Bool) -> Parser a a
32 check :: (a -> Bool) -> Parser a a
33 (until) infix 2 :: (Parser a b) (Parser a c) -> Parser a [b]
34 item :: a -> Parser a a | Eq a
35 list :: [a] -> Parser a [a] | Eq a
36 \end{lstlisting}
37
38 \subsection{Lexing \& Grammar}
39 The existing \SPL{} grammar was not suitable for parsing and lexing. The
40 grammar listed in Listing~\ref{lst:grammar} shows the current and final
41 grammar. The grammar for \SPLC{} has been extended to accommodate for the
42 syntactic sugar \SPLC{} introduces, such as string and list literals,
43 and for anonymous lambda functions.
44 Moreover we have applied transformation on the grammar including eliminating
45 left recursion, fixing operator priority and associativity.
46
47 To make the parsing more easy we first lex the character input into tokens. In
48 this way the parser can be simplified a lot and some more complex constructs
49 can be lexed as one token such as literal characters.
50
51 As said, the lexer uses a \Yard{} parser. The parser takes a list of characters
52 and returns a list of \CI{Token}s. A token is a \CI{TokenValue} accompanied
53 with a position and the \emph{ADT} used is shown in Listing~\ref{lst:lextoken}.
54 Parser combinators make it very easy to account for arbitrary white space and
55 it is much less elegant to do this in a regular way.
56 Listing~\ref{lst:lexerwhite} shows the way we handle white space and recalculate
57 positions. By choosing to lex with parser combinators the speed of the phase
58 decreases. However due to the simplicity of the lexer this is barely
59 measurable.
60
61 \begin{lstlisting}[
62 language=Clean,
63 label={lst:lexerwhite},
64 caption={Lexer whitespace handling}]
65 lexProgram :: Int Int -> Parser Char [Token]
66 lexProgram line column = lexToken >>= \t->case t of
67 LexEOF = pure []
68 LexNL = lexProgram (line+1) 1
69 (LexSpace l c) = lexProgram (line+l) (column+c)
70 (LexItemError e) = fail <?>
71 PositionalError line column ("LexerError: " +++ e)
72 (LexToken c t) = lexProgram line (column+c)
73 >>= \rest->pure [({line=line,col=column}, t):rest]
74 \end{lstlisting}
75
76 \begin{lstlisting}[
77 language=Clean,
78 label={lst:lextoken},
79 caption={Lexer token ADT}]
80 :: Pos = {line :: Int, col :: Int}
81 :: Token :== (Pos, TokenValue)
82 :: TokenValue
83 //Value tokens
84 = IdentToken String
85 | NumberToken Int
86 | CharToken Char
87 | StringToken [Char]
88 //Keyword tokens
89 | VarToken
90 | ReturnToken
91 | IfToken
92 | ...
93 \end{lstlisting}
94
95 A lexer written with parser combinators is pretty straight forward and in
96 practice it is just a very long list of \texttt{alternative} monadic
97 combinators. However the order is very important since some tokens are very
98 greedy. Almost all tokens can be lexed trivially with the \CI{list} combinator
99 for matching literal strings. Comments and literals are the only exception that
100 require a non trivial parser.
101
102 \subsection{Parsing}
103 The parsing phase is the second phase of the parser and is yet again a \Yard{}
104 parser that transforms a list of tokens to an Abstract Syntax Tree(\AST). The
105 full abstract syntax tree is listed in Listing~\ref{lst:ast} which closely
106 resembles the grammar.
107
108 The parser uses the standard \Yard{} combinators. Due to the modularity of
109 combinators it was and is very easy to add functionality to the parser. The
110 parser also handles some syntactic sugar(Section~\ref{ssec:synsugar}). For
111 example the parser expands the literal lists and literal strings to the
112 according list or string representation. Moreover the parser transforms the let
113 expressions to real functions representing constant values.
114
115 As an example we show the parsing of a \CI{FunDecl} in
116 Listing~\ref{lst:fundecl}. The \CI{FunDecl} is one of the most complex parsers
117 and is composed of as complex subparsers. A \CI{FunDecl} parser exactly one
118 complete function.
119
120 First we do the non consuming \CI{peekPos} to get the position of the first
121 token without consuming it, after that the identifier is parsed
122 which is basically a parser that transforms an \CI{IdToken} into a \CI{String}.
123
124 \CI{parseBBraces} is a parser combinator built from the basic combinators that
125 parses a parser when the contents are surrounded by braces. This is followed by
126 a yet another helper function to parse comma separated items.
127
128 Since the next element of the \ADT{} is a \CI{Maybe} the next parser is
129 combined with the optional combinator since the type of a function is optional
130 and when not given it will be inferred later.
131
132 A function consists of \emph{many} \CI{VarDecl}s and then \emph{many}
133 \CI{Stmt}s. The word \emph{many} in \Yard{} means zero or more while the word
134 \emph{some} means one or more. Thus we allow a function to be empty. The reason
135 why \CI{flatten} is mapped on to the last argument of the \CI{liftM6} is
136 because the \CI{parseStmt} returns possibly multiple \CI{Stmt}s. This is
137 because in the parsing phase we expand the \CI{print} functions into multiple
138 instances and thus it can be the case that from one \CI{Stmt} we actually get
139 multiple.
140
141 \begin{lstlisting}[
142 language=Clean,
143 label={lst:fundecl},
144 caption={Parsing a \CI{FunDecl}}]
145 :: FunDecl = FunDecl Pos String [String] (Maybe Type) [VarDecl] [Stmt]
146
147 parseFunDecl :: Parser Token FunDecl
148 parseFunDecl = liftM6 FunDecl
149 (peekPos)
150 (parseIdent)
151 (parseBBraces $ parseSepList CommaToken parseIdent)
152 (optional (satTok DoubleColonToken *> parseFunType))
153 (satTok CBraceOpenToken *> many parseVarDecl)
154 (flatten <$> (many parseStmt <* satTok CBraceCloseToken))
155 \end{lstlisting}
156
157 \subsection{Sugars} \label{ssec:synsugar}
158 Some syntactic sugars are procesessed in the lexing or parsing phase. Entering
159 literal lists and strings is a tedious job because you can only glue them
160 together with cons opererators. Thus we introduced a literal lists notation as
161 visible in the grammar. Literal strings are lexed as single tokens and expanded
162 during parsing in to normal lists of characters. Literal lists are processed in
163 the parsing phase and expanded to their similar form with the \texttt{Cons}
164 operator. An example of this is shown in Listing~\ref{lst:litlist}
165
166 \begin{lstlisting}[
167 language=SPLCode,
168 caption={Literal lists},
169 label={lst:litlist}]
170 //Before transformation
171 var a = "Hello world!";
172 var b = [3, 1, 4, 1, 5];
173
174 //After transformation
175 var a = 'H':'e':'l':'l':'o':' ':'w':'o':'r':'l':'d':'!':[]
176 var b = 3 : 1 : 4 : 1 : 5 : [];
177 \end{lstlisting}
178
179 Another sugar added is variable arguments printing. To enable the user to
180 pretty print data the number of arguments of the \SI{print} function is not
181 fixed and it is expanded at compiletime. With this sugar the programmer can
182 print several different types in one line. Listing~\ref{lst:varprint} shows an
183 example of the sugar. In the semantic analysis phase these statements are
184 actually refined even more to represent the specific print function per type
185 since the compiler allows printing of \SI{Int}, \SI{Bool}, \SI{Char} and
186 \SI{[Char]}
187 \begin{lstlisting}[
188 language=SPLCode,
189 caption={Variable arguments printing},
190 label={lst:varprint}]
191 //Before transformation
192 print("abc", 42, 'c', "da\n");
193
194 //After transformation
195 print('a':'b':'c':[]);
196 print(42);
197 print('c');
198 print('d':'a':'\n':[]);
199 \end{lstlisting}
200
201 The last sugar processed in the parsing or lexing phase are the \SI{let}
202 constructions. These constructions provide an easy way of defining global
203 constants without real constants. While we do not allow global variables one
204 might want to introduce global constants and with these constructs this can be
205 done very easily. Listing~\ref{lst:letgl} shows the expansion of \SI{let}
206 declarations.
207 \begin{lstlisting}[
208 language=SPLCode,
209 caption={\SI{let} expansion},
210 label={lst:letgl}]
211 //Before transformation
212 let Int x = 42;
213
214 //After transformation
215 x() :: -> Int{
216 return 42;
217 }
218 \end{lstlisting}