deliverables/report/pars.tex

   1 \section{Lexing \& parsing}\label{sec:pars}
   2 \subsection{\Yard}
   3 For both lexing and parsing we use a minimal parser combinator
   4 library called \Yard. \Yard{} is inspired by the well-known \textsc{Parsec}
   5 library. Where \Yard{} only has 7 primitive parser(s) (combinators)
   6 \textsc{Parsec} already has 9 categories of parser combinators with numerous
   7 combinators in every category.
   8
   9 Parsers implemented in Yard parse left to right and are greedy, that is that
  10 they always try to consume as much input as possible, starting from the left
  11 most encountered token. \Yard's parsers also automatically apply backtracking,
  12 when the left parser combined with \CI{<|>} fails, then any input it might have
  13 consumed is restored and the right parser is executed.
  14
  15 \begin{lstlisting}[language=Clean]
  16 :: Error = PositionalError Int Int String | Error String
  17 :: Parser a b = Parser ([a] -> (Either Error b, [a]))
  18
  19 (<?>) :: (Parser a b) Error -> Parser a b
  20 fail :: Parser a b
  21 top :: Parser a a
  22 peek :: Parser a a
  23 eof :: Parser a Void
  24 pure :: b -> Parser a b
  25 (>>=) :: (Parser a b) (b -> Parser a c) -> Parser a c
  26 (<|>) :: (Parer a b) (Parser a b) -> Parser a b
  27 --derived parser functions
  28 satisfy :: (a -> Bool) -> Parser a a
  29 check :: (a -> Bool) -> Parser a a
  30 (until) infix 2 :: (Parser a b) (Parser a c) -> Parser a [b]
  31 item :: a -> Parser a a | Eq a
  32 list :: [a] -> Parser a [a] | Eq a
  33 \end{lstlisting}
  34
  35 \subsection{Lexing \& Grammar}
  36 The existing \SPL{} grammar was not suitable for parsing and lexing. The
  37 grammar listed in Listing~\ref{lst:grammar} shows the current and final
  38 grammar. The grammar for \SPLC{} has been extended to accommodate for the
  39 syntactic sugar \SPLC{} introduces, such as string and list literals, and for
  40 anonymous lambda functions. Moreover we have applied transformations on the
  41 grammar including eliminating left recursion, fixing operator priority and
  42 associativity.
  43
  44 To make the parsing easier we first lex the character input into tokens. In
  45 this way the parser can be simplified a lot and some more complex constructs
  46 can be lexed as one token such as literal characters.
  47
  48 As said, the lexer uses a \Yard{} parser. The parser takes a list of characters
  49 and returns a list of \CI{Token}s. A token is a \CI{TokenValue} accompanied
  50 with a position. An excerpt of the \emph{ADT} of tokens is shown in
  51 Listing~\ref{lst:lextoken}.
  52 The used parser combinators make it very easy to account for arbitrary white
  53 space and in a much more elegant way manipulating the input as a list of
  54 characters directly.
  55 Listing~\ref{lst:lexerwhite} shows the way we handle white space and
  56 recalculate positions. By choosing to lex with parser combinators the speed of
  57 the phase decreases. However due to the simplicity of the lexer this is barely
  58 measurable.
  59
  60 \begin{lstlisting}[
  61         language=Clean,
  62         label={lst:lexerwhite},
  63         caption={Lexer whitespace handling}]
  64 lexProgram :: Int Int -> Parser Char [Token]
  65 lexProgram line column = lexToken >>= \t->case t of
  66         LexEOF = pure []
  67         LexNL = lexProgram (line+1) 1
  68         (LexSpace l c) = lexProgram (line+l) (column+c)
  69         (LexItemError e) = fail <?>
  70                 PositionalError line column ("LexerError: " +++ e)
  71         (LexToken c t) = lexProgram line (column+c)
  72                 >>= \rest->pure [({line=line,col=column}, t):rest]
  73 \end{lstlisting}
  74
  75 \begin{lstlisting}[
  76         language=Clean,
  77         label={lst:lextoken},
  78         caption={Lexer token ADT}]
  79 :: Pos = {line :: Int, col :: Int}
  80 :: Token :== (Pos, TokenValue)
  81 :: TokenValue
  82         //Value tokens
  83         = IdentToken String
  84         | NumberToken Int
  85         | CharToken Char
  86         | StringToken [Char]
  87         //Keyword tokens
  88         | VarToken
  89         | ReturnToken
  90         | IfToken
  91         | ...
  92 \end{lstlisting}
  93
  94 A lexer written with parser combinators is pretty straight forward and in
  95 practice it is just a very long list of \texttt{alternative} monadic
  96 combinators. However the order is very important since some tokens are so
  97 greedy they might consume other tokens. Almost all tokens can be lexed trivially
  98 with the \CI{list} combinator
  99 for matching literal strings. Comments and literals are the only exception that
 100 require a non trivial parser.
 101
 102 \subsection{Parsing}
 103 The parsing phase is the second phase of the parser and is yet again a \Yard{}
 104 parser that transforms a list of tokens to an \emph{Abstract Syntax
 105 Tree} (\AST). The full abstract syntax tree is listed in Listing~\ref{lst:ast}
 106 and closely resembles the grammar.
 107
 108 The parser uses the standard \Yard{} combinators. Due to the modularity of
 109 combinators it was and is very easy to add functionality to the parser. The
 110 parser also handles some syntactic sugar(Section~\ref{ssec:synsugar}). For
 111 example the parser expands the literal lists and literal strings to the
 112 according list or string representation. Moreover the parser transforms the let
 113 expressions to real functions representing constant values.
 114
 115 As an example we show the parsing of a \CI{FunDecl} in
 116 Listing~\ref{lst:fundecl}. The \CI{FunDecl} is one of the most complex parsers
 117 and is composed of complex subparsers. A \CI{FunDecl} parser exactly one
 118 complete function declaration.
 119
 120 First we do the non consuming \CI{peekPos} to get the position of the first
 121 token without consuming it, after that the identifier is parsed
 122 which is basically a parser that transforms an \CI{IdToken} into a \CI{String}.
 123
 124 \CI{parseBBraces} is a parser combinator built from the basic combinators that
 125 parses a parser when the contents are surrounded by braces. This is followed by
 126 a yet another helper function to parse comma separated items.
 127
 128 Since the next element of the \ADT{} is a \CI{Maybe} the next parser is
 129 combined with the optional combinator since the type of a function is optional
 130 and when not given will be inferred later.
 131
 132 A function consists of \emph{many} \CI{VarDecl}s and then \emph{many}
 133 \CI{Stmt}s. The word \emph{many} in \Yard{} means zero or more while the word
 134 \emph{some} means one or more. Thus we allow a function to be empty. The reason
 135 why \CI{flatten} is mapped on to the last argument of the \CI{liftM6} is
 136 because the \CI{parseStmt} returns possibly multiple \CI{Stmt}s. This is
 137 because in the parsing phase we expand the \CI{print} functions into multiple
 138 instances and thus it can be the case that from one \CI{Stmt} we actually get
 139 multiple.
 140
 141 \begin{lstlisting}[
 142         language=Clean,
 143         label={lst:fundecl},
 144         caption={Parsing a \CI{FunDecl}}]
 145 :: FunDecl = FunDecl Pos String [String] (Maybe Type) [VarDecl] [Stmt]
 146
 147 parseFunDecl :: Parser Token FunDecl
 148 parseFunDecl = liftM6 FunDecl
 149         (peekPos) //position of FunDecl
 150         (parseIdent) //function name
 151         (parseBBraces $ parseSepList CommaToken parseIdent) //list of arguments
 152         (optional (satTok DoubleColonToken *> parseFunType)) //optional type
 153         (satTok CBraceOpenToken *> many parseVarDecl) //list of VarDecls
 154         (flatten <$> (many parseStmt <* satTok CBraceCloseToken)) //list of Stmts
 155 \end{lstlisting}
 156
 157 \subsection{Sugars} \label{ssec:synsugar}
 158 Some syntactic sugars are processed in the lexing or parsing phase. Entering
 159 literal lists and strings is a tedious job because you can only glue them
 160 together with cons operators. Thus we introduced a literal lists notation as
 161 visible in the grammar. Literal strings are lexed as single tokens and expanded
 162 during parsing in to normal lists of characters. Literal lists are processed in
 163 the parsing phase and expanded to their similar form with the \texttt{Cons}
 164 operator. An example of this is shown in Listing~\ref{lst:litlist}
 165
 166 \begin{lstlisting}[
 167         language=SPLCode,
 168         caption={Literal lists},
 169         label={lst:litlist}]
 170 //Before transformation
 171 var a = "Hello world!";
 172 var b = [3, 1, 4, 1, 5];
 173
 174 //After transformation
 175 var a = 'H':'e':'l':'l':'o':' ':'w':'o':'r':'l':'d':'!':[]
 176 var b = 3 : 1 : 4 : 1 : 5 : [];
 177 \end{lstlisting}
 178
 179 Another sugar added is variable arguments printing. To enable the user to
 180 pretty print data the number of arguments of the \SI{print} function is not
 181 fixed and it is expanded at time of compilation. With this sugar the programmer
 182 can print several different types in one line. Listing~\ref{lst:varprint} shows
 183 an example of the sugar. In the semantic analysis phase these statements are
 184 actually refined even more to represent the specific print function per type
 185 during the semantical analysis phase, since \SPLC{} allows printing of
 186 \SI{Int}, \SI{Bool}, \SI{Char} and \SI{[Char]}
 187 \begin{lstlisting}[
 188         language=SPLCode,
 189         caption={Variable arguments printing},
 190         label={lst:varprint}]
 191 //Before transformation
 192 print("abc", 42, 'c', "da\n");
 193
 194 //After transformation
 195 print('a':'b':'c':[]);
 196 print(42);
 197 print('c');
 198 print('d':'a':'\n':[]);
 199 \end{lstlisting}
 200
 201 The last sugar processed in the parsing phase are the \SI{let}
 202 constructions. These provide an easy way of defining global
 203 constants without real constants. While we do not allow global variables one
 204 might want to introduce global constants and with these constructs this can be
 205 done very easily. Listing~\ref{lst:letgl} shows the expansion of \SI{let}
 206 declarations.
 207 \begin{lstlisting}[
 208         language=SPLCode,
 209         caption={\SI{let} expansion},
 210         label={lst:letgl}]
 211 //Before transformation
 212 let Int x = 42;
 213
 214 //After transformation
 215 x() :: -> Int{
 216         return 42;
 217 }
 218 \end{lstlisting}