thesis2/2.requirementsanddesign.tex

   1 \section{Requirements}
   2 \subsection{Introduction}
   3 As almost every plan for an application starts with a set of requirements so
   4 will this application. Requirements are a set of goals within different
   5 categories that will define what the application has to be able to do and they
   6 are traditionally defined at the start of the project and not expected to
   7 change much. In the case of this project the requirements were a lot more
   8 flexible because there was only one person doing the programming and there was
   9 a weekly meeting to discuss the matters and most importantly discuss the
  10 required changes. Because of this a lot of initial requirements are removed and
  11 a some requirements were added in the process. The list below shows the
  12 definitive requirements and also the suspended requirements.
  13
  14 There are two types of requirements, functional and non-functional
  15 requirements. Functional requirements are requirements that describe a certain
  16 function in the technical sense. Non-functional requirements describe a
  17 property. Properties can be for example efficiency, portability or
  18 compatibility. To make us able to refer to them later we give the
  19 requirements unique codes. As for the definitive requirements a verbose
  20 explanation is also provided.
  21
  22 \subsection{Functional requirements}
  23 \subsubsection{Original functional requirements}
  24 \begin{itemize}
  25         \item[I1:] The system should be able to crawl several source types.
  26                 \begin{itemize}
  27                         \item[I1a:] Fax/email.
  28                         \item[I1b:] XML feeds.
  29                         \item[I1c:] RSS feeds.
  30                         \item[I1d:] Websites.
  31                 \end{itemize}
  32         \item[I2:] Apply low level matching techniques on isolated data.
  33         \item[I3:] Insert data in the database.
  34         \item[I4:] The system should have a user interface to train crawlers
  35                 that is usable by someone without a particular computer science
  36                 background.
  37         \item[I5:] The system should be able to report to the employee when a
  38                 source has been changed too much for successful crawling.
  39 \end{itemize}
  40
  41 \subsubsection{Definitive functional requirements}
  42 Requirement I2 is the one requirement that is dropped completely, this is due
  43 to time constraints. The time limitation is partly because we chose to
  44 implement certain other requirements like an interactive intuitive user
  45 interface around the core of the pattern extraction program. Below are all
  46 definitive requirements.
  47 \begin{itemize}
  48         \item[F1:] The system should be able to crawl RSS feeds.
  49
  50                 This requirement is an adapted version of the compound
  51                 requirements I1a-I1d. We limited the source types to crawl to
  52                 strict RSS because of the time constraints of the project. Most
  53                 sources require an entirely different strategy and therefore we
  54                 could not easily combine them. An explanation why we chose RSS
  55                 feeds can be found in Section~\ref{sec:whyrss}.
  56
  57         \item[F2:] Export the data to a strict XML feed.
  58
  59                 This requirement is an adapted version of requirement I3, this
  60                 is also done to limit the scope. We chose to not interact
  61                 directly with the database or the \textit{Temporum}. The
  62                 application however is able to output XML data that is
  63                 formatted following a string XSD scheme so that it is easy to
  64                 import the data in the database or \textit{Temporum} in a
  65                 indirect way.
  66         \item[F3:] The system should have a user interface to create crawlers
  67                 that is usabl for someone without a particular computer science
  68                 background.
  69
  70                 This requirement is formed from I4. Initially the user
  71                 interface for adding and training crawlers was done via a
  72                 web interface that was user friendly and usable by someone
  73                 without a particular computer science background as the
  74                 requirement stated. However in the first prototypes the control
  75                 center that could test, edit and remove crawlers was a command
  76                 line application and thus not very usable for the general
  77                 audience. This combined requirements asks for a single control
  78                 center that can do all previously described tasks with an
  79                 interface that is usable without prior knowledge in computer
  80                 science.
  81         \item[F4:] Report to the user or maintainer when a source has been
  82                 changed too much for successful crawling.
  83
  84                 This requirement was also present in the original requirements
  85                 and has not changed. When the crawler fails to crawl a source,
  86                 this can be due to any reason, a message is sent to the people
  87                 using the program so that they can edit or remove the faulty
  88                 crawler. Updating without the need of a programmer is essential
  89                 in shortening the feedback loop explained in
  90                 Figure~\ref{feedbackloop}.
  91 \end{itemize}
  92
  93 \subsection{Non-functional requirements}
  94 \subsubsection{Original functional requirements}
  95 \begin{itemize}
  96         \item[O1:] Integrate in the existing system used by Hyperleap.
  97         \item[O2:] The system should work in a modular fashion, thus be able
  98                 to, in the future, extend the program.
  99 \end{itemize}
 100
 101 \subsubsection{Definitive functional requirements}
 102 \begin{itemize}
 103         \item[N1:] Work in a modular fashion, thus be able to, in the future,
 104                 extend the program.
 105
 106                 The modularity is very important so that the components can be
 107                 easily extended and components can be added. Possible
 108                 extensions are discussed in Section~\ref{sec:discuss}.
 109         \item[N2:] Operate standalone on a server.
 110
 111                 Non-functional requirement O1 is dropped because we want to
 112                 keep the program as modular as possible and via an XML
 113                 interface we still have a very intimate connection with the
 114                 database without having to maintain a direct connection. The
 115                 downside of an indirect connection instead of a direct
 116                 connection is that the specification is much more rigid. If the
 117                 system changes the specification the backend program should
 118                 also change.
 119 \end{itemize}
 120
 121 \section{Application overview}
 122 The workflow of the application can be divided into several components or
 123 steps. The overview of the application is visible in Figure~\ref{appoverview}.
 124 The nodes are applications or processing steps and the arrows denote
 125 information flow or movement between nodes.
 126 \begin{figure}[H]
 127         \centering
 128         \includegraphics[width=\linewidth]{appoverview.pdf}
 129         \caption{Overview of the application\label{appoverview}}
 130 \end{figure}
 131
 132 \subsection{Frontend}
 133 \subsubsection{General description}
 134 The frontend is a web interface that is connected to the backend system which
 135 allows the user to interact with the backend. The frontend consists of a basic
 136 graphical user interface which is shown in Figure~\ref{frontendfront}. As the
 137 interface shows, there are three main components that the user can use.  There
 138 is also a button for downloading the XML\@. The \textit{Get xml} button is a
 139 quick shortcut to make the backend to generate XML\@. The button for grabbing the
 140 XML data is only for diagnostic purposes located there. In the standard
 141 workflow the XML button is not used. In the standard workflow the server
 142 periodically calls the XML output option from the command line interface of the
 143 backend to process it.
 144
 145 \begin{figure}[H]
 146         \includegraphics[width=\linewidth]{frontendfront.pdf}
 147         \caption{The landing page of the frontend\label{frontendfront}}
 148 \end{figure}
 149
 150 \subsubsection{Repair/Remove crawler}
 151 This component lets the user view the crawlers and remove the crawlers from the
 152 crawler database. Doing one of these things with a crawler is as simple as
 153 selecting the crawler from the dropdown menu and selecting the operation from
 154 the other dropdown menu and pressing \textit{Submit}.
 155
 156 Removing the crawler will remove the crawler completely from the crawler
 157 database and the crawler will be unrecoverable. Editing the crawler will open a
 158 similar screen as when adding the crawler. The details about that screen will
 159 be discussed in Section~\ref{addcrawler}. The only difference is that the
 160 previous trained patterns are already made visible in the training interface
 161 and can thus be adapted to change the crawler for possible source changes for
 162 example.
 163
 164 \subsubsection{Add new crawler}
 165 \label{addcrawler}
 166 The addition or generation of crawlers is the key feature of the program and it
 167 is the intelligent part of the system since it includes the graph optimization
 168 algorithm to recognize user specified patterns in the new data. First, the user
 169 must fill in the static form that is visible on top of the page. This for
 170 example contains general information about the venue together with some crawler
 171 specific values such as crawling frequency. After that the user can mark
 172 certain points in the table as being of a category. Marking text is as easy as
 173 selecting the text and pressing the according button. The text visible in the
 174 table is a stripped down version of the original RSS feeds \texttt{title} and
 175 \texttt{summary} fields. When the text is marked it will be highlighted in the
 176 same color as the color of the button text. The entirety of the user interface
 177 with a few sample markings is shown in Figure~\ref{frontendfront}. After the
 178 marking of the categories the user can preview the data or submit. Previewing
 179 will run the crawler on the RSS feed in memory and the user can revise the
 180 patterns if necessary. Submitting will send the page to the backend to be
 181 processed. The internals of what happens after submitting is explained in
 182 detail in Figure~\ref{appinternals} together with the text.
 183
 184 \begin{figure}[H]
 185         \centering
 186         \includegraphics[width=\linewidth]{crawlerpattern.pdf}
 187         \caption{A view of the interface for specifying the pattern. Two %
 188 entries are already marked.\label{frontendfront}}
 189 \end{figure}
 190
 191 \subsubsection{Test crawler}
 192 The test crawler component is a very simple non interactive component that
 193 allows the user to verify if a crawler functions properly without having to
 194 access the database via the command line utilities. Via a dropdown menu the
 195 user selects the crawler and when submit is pressed the backend generates a
 196 results page that shows a small log of the crawler, a summary of the results
 197 and most importantly the results itself. In this way the user can see in a few
 198 gazes if the crawler functions properly. Humans are very fast in detecting
 199 patterns and therefore the error checking goes very fast. Because the log of
 200 the crawl operation is shown this page can also be used for diagnostic
 201 information about the backends crawling system. The logging is in depth and
 202 also shows possible exceptions and is therefore also usable for the developers
 203 to diagnose problems.
 204
 205 \subsection{Backend}
 206 \subsubsection{Program description}
 207 The backend consists of a main module and a set of libraries all written in
 208 \textit{Python}\cite{Python}. The main module is embedded in an apache
 209 HTTP-server\cite{apache} via the \textit{mod\_python} apache
 210 module\cite{Modpython}. The module \textit{mod\_python} allows handling for
 211 python code via HTTP and this allows us to integrate neatly with the
 212 \textit{Python} libraries. We chose \textit{Python} because of the rich set of
 213 standard libraries and solid cross platform capabilities. We chose specifically
 214 for \textit{Python} 2 because it is still the default \textit{Python} version
 215 on all major operating systems and stays supported until at least the year
 216 $2020$. This means that the program can function at least for 5 full years. The
 217 application consists of a main \textit{Python} module that is embedded in the
 218 HTTP-server. Finally there are some libraries and there is a standalone program
 219 that does the periodic crawling.
 220
 221 \subsubsection{Main module}
 222 The main module is the program that deals with the requests, controls the
 223 frontend, converts the data to patterns and sends the patterns to the crawler.
 224 The module serves the frontend in a modular fashion. For example the buttons
 225 and colors can be easily edited by a non programmer by just changing the
 226 appropriate values in a text file. In this way even when conventions change the
 227 program can still function without intervention of a programmer that needs to
 228 adapt the source.
 229
 230 \subsubsection{Libraries}
 231 The libraries are called by the main program and take care of all the hard
 232 work. Basically the libraries are a group of python scripts that for example
 233 minimize the graphs, transform the user data into machine readable data, export
 234 the crawled data to XML and much more.
 235
 236 \subsubsection{Standalone crawler}
 237 The crawler is a program, also written in Python, that is used by the main
 238 module and technically is part of the libraries. The property in which the
 239 crawler stands out is the fact that it also can run on its own. The crawler has
 240 to run periodically by a server to literally crawl the websites. The main
 241 module communicates with the crawler when it is queried for XML data, when a
 242 new crawler is added or when data is edited. The crawler also offers a command
 243 line interface that has the same functionality as the web interface of the
 244 control center.
 245
 246 The crawler saves all the data in a database. The database is a simple
 247 dictionary where all the entries are hashed so that the crawler knows which
 248 ones are already present in the database and which ones are new. In this way
 249 the crawler does not have to process all the old entries when they appear in
 250 the feed. The RSS' GUID could also have been used but since it is an optional
 251 value in the feed not every feed uses the GUID and therefore it is not reliable
 252 to use it. The crawler also has a function to export the database to XML
 253 format. The XML output format is specified in an XSD\cite{Xsd} file for minimal
 254 ambiguity.
 255
 256 \subsubsection{XML \& XSD}
 257 XML is a file format that can describe data structures. XML can be accompanied
 258 by an XSD file that describes the format. An XSD file is in fact just another
 259 XML file that describes the format of XML files. Almost all programming
 260 languages have an XML parser built in and therefore it is a very versatile
 261 format that makes the eventual import to the database very easy. The most used
 262 languages also include XSD validation to detect XML errors, validity and
 263 completeness of XML files. This makes interfacing with the database and
 264 possible future programs even more easy. The XSD scheme used for this
 265 programs output can be found in the appendices in Algorithm~\ref{scheme.xsd}. The
 266 XML output can be queried via the HTTP interface that calls the crawler backend
 267 to crunch the latest crawled data into XML\@. It can also be acquired directly
 268 from the crawlers command line interface.