deliverables/report/pars.tex

   1 \section{Lexing \& parsing}\label{sec:pars}
   2 \subsection{\Yard}
   3 For both lexing and parsing we use proof of concept state of the art minimal
   4 parser combinators library called \Yard. \Yard{} is inspired by the well-known
   5 \textsc{Parsec} library. Where \Yard{} only has 9 parser combinators
   6 \textsc{Parsec} already has 9 categories of parser combinators with in every
   7 category there are numerous combinators.
   8
   9 \begin{lstlisting}[language=Clean]
  10 :: Error = PositionalError Int Int String | Error String
  11 :: Parser a b = Parser ([a] -> (Either Error b, [a]))
  12
  13 (<?>) :: (Parser a b) Error -> Parser a b
  14 fail :: Parser a b
  15 top :: Parser a a
  16 peek :: Parser a a
  17 satisfy :: (a -> Bool) -> Parser a a
  18 check :: (a -> Bool) -> Parser a a
  19 (until) infix 2 :: (Parser a b) (Parser a c) -> Parser a [b]
  20 item :: a -> Parser a a | Eq a
  21 list :: [a] -> Parser a [a] | Eq a
  22 eof :: Parser a Void
  23 \end{lstlisting}
  24
  25 \subsection{Lexing \& Grammar}
  26 The existing \SPL{} grammar was not suitable for parsing and lexing. The
  27 grammar listed in Listing~\ref{lst:grammar} shows the current and final
  28 grammar. We have added several clauses that are explained in
  29 Section~\ref{sec:ext} such as string literals and higher order functions. The
  30 transformation of the grammar include eliminating left recursion, fixing
  31 operator priority and associativity.
  32
  33 To make the parsing more easy we first lex the character input into tokens. In
  34 this way the parser can be simplified a lot and some more complex constructs
  35 can be lexed as one token such as literal characters.
  36
  37 As said, the lexer uses a \Yard{} parser. The parser takes a list of characters
  38 and returns a list of \CI{Token}s. A token is a \CI{TokenValue} accompanied
  39 with a position and the \emph{ADT} used is show in Listing~\ref{lst:lextoken}.
  40 Parser combinators make it very easy to account for arbitrary white space and it
  41 is much less elegant to do this in a regular way. By choosing to lex with
  42 parser combinators the speed of the phase decreases. However, since the parsers
  43 are very easy this increase is very small.
  44
  45 \begin{lstlisting}[
  46         language=Clean,
  47         label={lst:lextoken},
  48         caption={Lexer token ADT}]
  49 :: Pos = {line :: Int, col :: Int}
  50 :: Token :== (Pos, TokenValue)
  51 :: TokenValue
  52         //Value tokens
  53         = IdentToken String
  54         | NumberToken Int
  55         | CharToken Char
  56         | StringToken [Char]
  57         //Keyword tokens
  58         | VarToken
  59         | ReturnToken
  60         | IfToken
  61         | ...
  62 \end{lstlisting}
  63
  64 A lexer written with parser combinators is pretty straight forward and in
  65 practice it is just a very long list of \texttt{alternative} monadic
  66 combinators. However the order is very important since some tokens are very
  67 greedy. Almost all tokens can be lexed trivially with the \CI{list} combinator
  68 for matching literal strings. Comments and literals are the only exception that
  69 require a non trivial parser.
  70
  71 \subsection{Parsing}
  72 The parsing phase is the second phase of the parser and is yet again a \Yard{}
  73 parser that transforms a list of tokens to an Abstract Syntax Tree(\AST). The
  74 full abstract syntax tree is listed in Listing~\ref{lst:ast} which closely
  75 resembles the grammar.
  76
  77 The parser uses the standard \Yard{} combinators. For clarity and easy the
  78 parser closely resembles the grammar. Due to the modularity of combinators it
  79 was and is very easy to add functionality to the parser. The parser also
  80 handles some syntactic sugar(Section~\ref{ssec:synsugar}). For example the
  81 parser expands the literal lists and literal strings to the according list or
  82 string representation. Moreover the parser transforms the let expressions to
  83 real functions representing constant values.
  84
  85 As an example we show the parsing of a \CI{FunDecl} in
  86 Listing~\ref{lst:fundecl}. The \CI{FunDecl} is one of the most complex parsers
  87 and is composed of as complex subparsers. A \CI{FunDecl} parser exactly one
  88 complete function.
  89
  90 First we do the non consuming \CI{peekPos} to make sure the positionality of
  91 the function is guaranteed in the \AST{}, after that the identifier is parsed
  92 which is basically a parser that transforms an \CI{IdToken} into a \CI{String}.
  93
  94 \CI{parseBBraces} is a parser combinator built from the basic combinators that
  95 parses a parser when the contents are surrounded by braces. This is followed by
  96 a yet another helper function to parse comma separated items.
  97
  98 Since the next element of the \ADT{} is a \CI{Maybe} the next parser is
  99 combined with the optional combinator since the type of a function is optional
 100 and when not given it will be inferred.
 101
 102 A function consists of \emph{many} \CI{VarDecl}s and then \emph{many}
 103 \CI{Stmt}s. The word \emph{many} in \Yard{} means zero or more while the word
 104 \emph{some} means one or more. Thus we allow a function to be empty. The reason
 105 why \CI{flatten} is mapped on to the last argument of the \CI{liftM6} is
 106 because the \CI{parseStmt} returns possibly multiple \CI{Stmt}s. This is
 107 because in the parsing phase we expand the \CI{print} functions into multiple
 108 instances and thus it can be the case that from one \CI{Stmt} we actually get
 109 multiple.
 110
 111 \begin{lstlisting}[
 112         language=Clean,
 113         label={lst:fundecl},
 114         caption={Parsing a \CI{FunDecl}}]
 115 :: FunDecl = FunDecl Pos String [String] (Maybe Type) [VarDecl] [Stmt]
 116
 117 parseFunDecl :: Parser Token FunDecl
 118 parseFunDecl = liftM6 FunDecl
 119     (peekPos)
 120     (parseIdent)
 121         (parseBBraces $ parseSepList CommaToken parseIdent)
 122         (optional (satTok DoubleColonToken *> parseFunType))
 123         (satTok CBraceOpenToken *> many parseVarDecl)
 124         (flatten <$> (many parseStmt <* satTok CBraceCloseToken))
 125 \end{lstlisting}