deliverables/report/pars.tex

   1 \section{Lexing \& parsing}\label{sec:pars}
   2 \subsection{\Yard}
   3 For both lexing and parsing we use proof of concept state of the art minimal
   4 parser combinators library called \Yard. \Yard{} is inspired by the well-known
   5 \textsc{Parsec} library. Where \Yard{} only has 9 parser combinators
   6 \textsc{Parsec} already has 9 categories of parser combinators with in every
   7 category there are numerous combinators.
   8
   9 \begin{lstlisting}[language=Clean]
  10 :: Error = PositionalError Int Int String | Error String
  11 :: Parser a b = Parser ([a] -> (Either Error b, [a]))
  12
  13 (<?>) :: (Parser a b) Error -> Parser a b
  14 fail :: Parser a b
  15 top :: Parser a a
  16 peek :: Parser a a
  17 satisfy :: (a -> Bool) -> Parser a a
  18 check :: (a -> Bool) -> Parser a a
  19 (until) infix 2 :: (Parser a b) (Parser a c) -> Parser a [b]
  20 item :: a -> Parser a a | Eq a
  21 list :: [a] -> Parser a [a] | Eq a
  22 eof :: Parser a Void
  23 \end{lstlisting}
  24
  25 \subsection{Lexing \& Grammar}
  26 The existing \SPL{} grammar was not suitable for parsing and lexing. The
  27 grammar listed in Listing~\ref{lst:grammar} shows the current and final
  28 grammar. We have added several clauses that are explained in
  29 Section~\ref{sec:ext} such as string literals and higher order functions.
  30 Moreover we have applied transformation on the grammar including eliminating
  31 left recursion, fixing operator priority and associativity.
  32
  33 To make the parsing more easy we first lex the character input into tokens. In
  34 this way the parser can be simplified a lot and some more complex constructs
  35 can be lexed as one token such as literal characters.
  36
  37 As said, the lexer uses a \Yard{} parser. The parser takes a list of characters
  38 and returns a list of \CI{Token}s. A token is a \CI{TokenValue} accompanied
  39 with a position and the \emph{ADT} used is show in Listing~\ref{lst:lextoken}.
  40 Parser combinators make it very easy to account for arbitrary white space and
  41 it is much less elegant to do this in a regular way.
  42 Listing~\ref{lst:lexerwhite} shows the way we handle white space and recalculate
  43 positions. By choosing to lex with parser combinators the speed of the phase
  44 decreases. However due to the simplicity of the lexer this is barely
  45 measurable.
  46
  47 \begin{lstlisting}[
  48         language=Clean,
  49         label={lst:lexerwhite},
  50         caption={Lexer whitespace handling}]
  51 lexProgram :: Int Int -> Parser Char [Token]
  52 lexProgram line column = lexToken >>= \t->case t of
  53         LexEOF = pure []
  54         LexNL = lexProgram (line+1) 1
  55         (LexSpace l c) = lexProgram (line+l) (column+c)
  56         (LexItemError e) = fail <?>
  57                 PositionalError line column ("LexerError: " +++ e)
  58         (LexToken c t) = lexProgram line (column+c)
  59                 >>= \rest->pure [({line=line,col=column}, t):rest]
  60 \end{lstlisting}
  61
  62 \begin{lstlisting}[
  63         language=Clean,
  64         label={lst:lextoken},
  65         caption={Lexer token ADT}]
  66 :: Pos = {line :: Int, col :: Int}
  67 :: Token :== (Pos, TokenValue)
  68 :: TokenValue
  69         //Value tokens
  70         = IdentToken String
  71         | NumberToken Int
  72         | CharToken Char
  73         | StringToken [Char]
  74         //Keyword tokens
  75         | VarToken
  76         | ReturnToken
  77         | IfToken
  78         | ...
  79 \end{lstlisting}
  80
  81 A lexer written with parser combinators is pretty straight forward and in
  82 practice it is just a very long list of \texttt{alternative} monadic
  83 combinators. However the order is very important since some tokens are very
  84 greedy. Almost all tokens can be lexed trivially with the \CI{list} combinator
  85 for matching literal strings. Comments and literals are the only exception that
  86 require a non trivial parser.
  87
  88 \subsection{Parsing}
  89 The parsing phase is the second phase of the parser and is yet again a \Yard{}
  90 parser that transforms a list of tokens to an Abstract Syntax Tree(\AST). The
  91 full abstract syntax tree is listed in Listing~\ref{lst:ast} which closely
  92 resembles the grammar.
  93
  94 The parser uses the standard \Yard{} combinators. Due to the modularity of
  95 combinators it was and is very easy to add functionality to the parser. The
  96 parser also handles some syntactic sugar(Section~\ref{ssec:synsugar}). For
  97 example the parser expands the literal lists and literal strings to the
  98 according list or string representation. Moreover the parser transforms the let
  99 expressions to real functions representing constant values.
 100
 101 As an example we show the parsing of a \CI{FunDecl} in
 102 Listing~\ref{lst:fundecl}. The \CI{FunDecl} is one of the most complex parsers
 103 and is composed of as complex subparsers. A \CI{FunDecl} parser exactly one
 104 complete function.
 105
 106 First we do the non consuming \CI{peekPos} to make sure the position of
 107 the function is guaranteed in the \AST{}, after that the identifier is parsed
 108 which is basically a parser that transforms an \CI{IdToken} into a \CI{String}.
 109
 110 \CI{parseBBraces} is a parser combinator built from the basic combinators that
 111 parses a parser when the contents are surrounded by braces. This is followed by
 112 a yet another helper function to parse comma separated items.
 113
 114 Since the next element of the \ADT{} is a \CI{Maybe} the next parser is
 115 combined with the optional combinator since the type of a function is optional
 116 and when not given it will be inferred.
 117
 118 A function consists of \emph{many} \CI{VarDecl}s and then \emph{many}
 119 \CI{Stmt}s. The word \emph{many} in \Yard{} means zero or more while the word
 120 \emph{some} means one or more. Thus we allow a function to be empty. The reason
 121 why \CI{flatten} is mapped on to the last argument of the \CI{liftM6} is
 122 because the \CI{parseStmt} returns possibly multiple \CI{Stmt}s. This is
 123 because in the parsing phase we expand the \CI{print} functions into multiple
 124 instances and thus it can be the case that from one \CI{Stmt} we actually get
 125 multiple.
 126
 127 \begin{lstlisting}[
 128         language=Clean,
 129         label={lst:fundecl},
 130         caption={Parsing a \CI{FunDecl}}]
 131 :: FunDecl = FunDecl Pos String [String] (Maybe Type) [VarDecl] [Stmt]
 132
 133 parseFunDecl :: Parser Token FunDecl
 134 parseFunDecl = liftM6 FunDecl
 135     (peekPos)
 136     (parseIdent)
 137         (parseBBraces $ parseSepList CommaToken parseIdent)
 138         (optional (satTok DoubleColonToken *> parseFunType))
 139         (satTok CBraceOpenToken *> many parseVarDecl)
 140         (flatten <$> (many parseStmt <* satTok CBraceCloseToken))
 141 \end{lstlisting}
 142
 143 \subsection{Sugars}
 144 Some syntactic sugars are procesessed in the lexing or parsing phase. Entering
 145 literal lists and strings is a tedious job because you can only glue them
 146 together with cons opererators. Thus we introduced a literal lists notation as
 147 visible in the grammar. Literal strings are lex as single tokens and expanded
 148 during parsing in to normal lists of characters. Literal lists are processed in
 149 the parsing phase and expanded to their similar form with the \texttt{Cons}
 150 operator. An example of this is shown in Listing~\ref{lst:litlist}
 151
 152 \begin{lstlisting}[
 153         language=SPLCode,
 154         caption={Literal lists},
 155         label={lst:litlist}]
 156 //Before transformation
 157 var a = "Hello world!";
 158 var b = [3, 1, 4, 1, 5];
 159
 160 //After transformation
 161 var a = 'H':'e':'l':'l':'o':' ':'w':'o':'r':'l':'d':'!':[]
 162 var b = 3 : 1 : 4 : 1 : 5 : [];
 163 \end{lstlisting}
 164
 165 Another sugar added is variable arguments printing. To enable the user to
 166 pretty print data the number of arguments of the \SI{print} function is not
 167 fixed and it is expanded at compiletime. With this sugar the programmer can
 168 print several different types in one line. Listing~\ref{lst:varprint} shows an
 169 example of the sugar. In the semantic analysis phase these statements are
 170 actually refined even more to represent the specific print function per type
 171 since the compiler allows printing of \SI{Int}, \SI{Bool}, \SI{Char} and
 172 \SI{[Char]}
 173 \begin{lstlisting}[
 174         language=SPLCode,
 175         caption={Variable arguments printing},
 176         label={lst:varprint}]
 177 //Before transformation
 178 print("abc", 42, 'c', "da\n");
 179
 180 //After transformation
 181 print('a':'b':'c':[]);
 182 print(42);
 183 print('c');
 184 print('d':'a':'\n':[]);
 185 \end{lstlisting}
 186
 187 The last sugar processed in the parsing or lexing phase are the \SI{let}
 188 constructions. These constructions provide an easy way of defining global
 189 constants without real constants. While we do not allow global variables one
 190 might want to introduce global constants and with these constructs this can be
 191 done very easily. Listing~\ref{lst:letgl} shows the expansion of \SI{let}
 192 declarations.
 193 \begin{lstlisting}[
 194         language=SPLCode,
 195         caption={\SI{let} expansion},
 196         label={lst:letgl}]
 197 //Before transformation
 198 let Int x = 42;
 199
 200 //After transformation
 201 x() :: -> Int{
 202         return 42;
 203 }
 204 \end{lstlisting}