deliverables/report/pars.tex

   1 \section{Lexing \& parsing}\label{sec:pars}
   2 \subsection{\Yard}
   3 For both lexing and parsing we use proof of concept minimal parser combinators
   4 library called \Yard. \Yard{} is inspired by the well-known \textsc{Parsec}
   5 library. Where \Yard{} only has 7 primitive parser (combinators)
   6 \textsc{Parsec} already has 9 categories of parser combinators with in every
   7 category there are numerous combinators.
   8
   9 Parsers implemented in Yard parse left to right and are greedy, that is that
  10 they always try to consume as much input as possible, starting from the left
  11 most encountered token. \Yard's parsers also automatically apply backtracking,
  12 when the left parser combined with \CI{<|>} fails, then any input it might have
  13 consumed is restored and the right parser is executed.
  14
  15 \begin{lstlisting}[language=Clean]
  16 :: Error = PositionalError Int Int String | Error String
  17 :: Parser a b = Parser ([a] -> (Either Error b, [a]))
  18
  19 (<?>) :: (Parser a b) Error -> Parser a b
  20 fail :: Parser a b
  21 top :: Parser a a
  22 peek :: Parser a a
  23 eof :: Parser a Void
  24 pure :: b -> Parser a b
  25 (>>=) :: (Parser a b) (b -> Parser a c) -> Parser a c
  26 (<|>) :: (Parer a b) (Parser a b) -> Parser a b
  27 --derived parser functions
  28 satisfy :: (a -> Bool) -> Parser a a
  29 check :: (a -> Bool) -> Parser a a
  30 (until) infix 2 :: (Parser a b) (Parser a c) -> Parser a [b]
  31 item :: a -> Parser a a | Eq a
  32 list :: [a] -> Parser a [a] | Eq a
  33 \end{lstlisting}
  34
  35 \subsection{Lexing \& Grammar}
  36 The existing \SPL{} grammar was not suitable for parsing and lexing. The
  37 grammar listed in Listing~\ref{lst:grammar} shows the current and final
  38 grammar. The grammar for \SPLC{} has been extended to accommodate for the
  39 syntactic sugar \SPLC{} introduces, such as string and list literals, and for
  40 anonymous lambda functions. Moreover we have applied transformation on the
  41 grammar including eliminating left recursion, fixing operator priority and
  42 associativity.
  43
  44 To make the parsing more easy we first lex the character input into tokens. In
  45 this way the parser can be simplified a lot and some more complex constructs
  46 can be lexed as one token such as literal characters.
  47
  48 As said, the lexer uses a \Yard{} parser. The parser takes a list of characters
  49 and returns a list of \CI{Token}s. A token is a \CI{TokenValue} accompanied
  50 with a position and the \emph{ADT} used is shown in Listing~\ref{lst:lextoken}.
  51 Parser combinators make it very easy to account for arbitrary white space and
  52 it is much less elegant to do this in a regular way.
  53 Listing~\ref{lst:lexerwhite} shows the way we handle white space and
  54 recalculate positions. By choosing to lex with parser combinators the speed of
  55 the phase decreases. However due to the simplicity of the lexer this is barely
  56 measurable.
  57
  58 \begin{lstlisting}[
  59         language=Clean,
  60         label={lst:lexerwhite},
  61         caption={Lexer whitespace handling}]
  62 lexProgram :: Int Int -> Parser Char [Token]
  63 lexProgram line column = lexToken >>= \t->case t of
  64         LexEOF = pure []
  65         LexNL = lexProgram (line+1) 1
  66         (LexSpace l c) = lexProgram (line+l) (column+c)
  67         (LexItemError e) = fail <?>
  68                 PositionalError line column ("LexerError: " +++ e)
  69         (LexToken c t) = lexProgram line (column+c)
  70                 >>= \rest->pure [({line=line,col=column}, t):rest]
  71 \end{lstlisting}
  72
  73 \begin{lstlisting}[
  74         language=Clean,
  75         label={lst:lextoken},
  76         caption={Lexer token ADT}]
  77 :: Pos = {line :: Int, col :: Int}
  78 :: Token :== (Pos, TokenValue)
  79 :: TokenValue
  80         //Value tokens
  81         = IdentToken String
  82         | NumberToken Int
  83         | CharToken Char
  84         | StringToken [Char]
  85         //Keyword tokens
  86         | VarToken
  87         | ReturnToken
  88         | IfToken
  89         | ...
  90 \end{lstlisting}
  91
  92 A lexer written with parser combinators is pretty straight forward and in
  93 practice it is just a very long list of \texttt{alternative} monadic
  94 combinators. However the order is very important since some tokens are very
  95 greedy. Almost all tokens can be lexed trivially with the \CI{list} combinator
  96 for matching literal strings. Comments and literals are the only exception that
  97 require a non trivial parser.
  98
  99 \subsection{Parsing}
 100 The parsing phase is the second phase of the parser and is yet again a \Yard{}
 101 parser that transforms a list of tokens to an \emph{Abstract Syntax
 102 Tree}(\AST). The full abstract syntax tree is listed in Listing~\ref{lst:ast}
 103 which closely resembles the grammar.
 104
 105 The parser uses the standard \Yard{} combinators. Due to the modularity of
 106 combinators it was and is very easy to add functionality to the parser. The
 107 parser also handles some syntactic sugar(Section~\ref{ssec:synsugar}). For
 108 example the parser expands the literal lists and literal strings to the
 109 according list or string representation. Moreover the parser transforms the let
 110 expressions to real functions representing constant values.
 111
 112 As an example we show the parsing of a \CI{FunDecl} in
 113 Listing~\ref{lst:fundecl}. The \CI{FunDecl} is one of the most complex parsers
 114 and is composed of as complex subparsers. A \CI{FunDecl} parser exactly one
 115 complete function.
 116
 117 First we do the non consuming \CI{peekPos} to get the position of the first
 118 token without consuming it, after that the identifier is parsed
 119 which is basically a parser that transforms an \CI{IdToken} into a \CI{String}.
 120
 121 \CI{parseBBraces} is a parser combinator built from the basic combinators that
 122 parses a parser when the contents are surrounded by braces. This is followed by
 123 a yet another helper function to parse comma separated items.
 124
 125 Since the next element of the \ADT{} is a \CI{Maybe} the next parser is
 126 combined with the optional combinator since the type of a function is optional
 127 and when not given it will be inferred later.
 128
 129 A function consists of \emph{many} \CI{VarDecl}s and then \emph{many}
 130 \CI{Stmt}s. The word \emph{many} in \Yard{} means zero or more while the word
 131 \emph{some} means one or more. Thus we allow a function to be empty. The reason
 132 why \CI{flatten} is mapped on to the last argument of the \CI{liftM6} is
 133 because the \CI{parseStmt} returns possibly multiple \CI{Stmt}s. This is
 134 because in the parsing phase we expand the \CI{print} functions into multiple
 135 instances and thus it can be the case that from one \CI{Stmt} we actually get
 136 multiple.
 137
 138 \begin{lstlisting}[
 139         language=Clean,
 140         label={lst:fundecl},
 141         caption={Parsing a \CI{FunDecl}}]
 142 :: FunDecl = FunDecl Pos String [String] (Maybe Type) [VarDecl] [Stmt]
 143
 144 parseFunDecl :: Parser Token FunDecl
 145 parseFunDecl = liftM6 FunDecl
 146         (peekPos)
 147         (parseIdent)
 148         (parseBBraces $ parseSepList CommaToken parseIdent)
 149         (optional (satTok DoubleColonToken *> parseFunType))
 150         (satTok CBraceOpenToken *> many parseVarDecl)
 151         (flatten <$> (many parseStmt <* satTok CBraceCloseToken))
 152 \end{lstlisting}
 153
 154 \subsection{Sugars} \label{ssec:synsugar}
 155 Some syntactic sugars are processed in the lexing or parsing phase. Entering
 156 literal lists and strings is a tedious job because you can only glue them
 157 together with cons operators. Thus we introduced a literal lists notation as
 158 visible in the grammar. Literal strings are lexed as single tokens and expanded
 159 during parsing in to normal lists of characters. Literal lists are processed in
 160 the parsing phase and expanded to their similar form with the \texttt{Cons}
 161 operator. An example of this is shown in Listing~\ref{lst:litlist}
 162
 163 \begin{lstlisting}[
 164         language=SPLCode,
 165         caption={Literal lists},
 166         label={lst:litlist}]
 167 //Before transformation
 168 var a = "Hello world!";
 169 var b = [3, 1, 4, 1, 5];
 170
 171 //After transformation
 172 var a = 'H':'e':'l':'l':'o':' ':'w':'o':'r':'l':'d':'!':[]
 173 var b = 3 : 1 : 4 : 1 : 5 : [];
 174 \end{lstlisting}
 175
 176 Another sugar added is variable arguments printing. To enable the user to
 177 pretty print data the number of arguments of the \SI{print} function is not
 178 fixed and it is expanded at time of compilation. With this sugar the programmer
 179 can print several different types in one line. Listing~\ref{lst:varprint} shows
 180 an example of the sugar. In the semantic analysis phase these statements are
 181 actually refined even more to represent the specific print function per type
 182 since the compiler allows printing of \SI{Int}, \SI{Bool}, \SI{Char} and
 183 \SI{[Char]}
 184 \begin{lstlisting}[
 185         language=SPLCode,
 186         caption={Variable arguments printing},
 187         label={lst:varprint}]
 188 //Before transformation
 189 print("abc", 42, 'c', "da\n");
 190
 191 //After transformation
 192 print('a':'b':'c':[]);
 193 print(42);
 194 print('c');
 195 print('d':'a':'\n':[]);
 196 \end{lstlisting}
 197
 198 The last sugar processed in the parsing or lexing phase are the \SI{let}
 199 constructions. These constructions provide an easy way of defining global
 200 constants without real constants. While we do not allow global variables one
 201 might want to introduce global constants and with these constructs this can be
 202 done very easily. Listing~\ref{lst:letgl} shows the expansion of \SI{let}
 203 declarations.
 204 \begin{lstlisting}[
 205         language=SPLCode,
 206         caption={\SI{let} expansion},
 207         label={lst:letgl}]
 208 //Before transformation
 209 let Int x = 42;
 210
 211 //After transformation
 212 x() :: -> Int{
 213         return 42;
 214 }
 215 \end{lstlisting}