thesis2/2.requirementsanddesign.tex

   1 \section{Requirements}
   2 \subsection{Introduction}
   3 As almost every plan for an application starts with a set of requirements so
   4 will this application. Requirements are a set of goals within different
   5 categories that will define what the application has to be able to do and they
   6 are traditionally defined at the start of the project and not expected to
   7 change much. In the case of this project the requirements were a lot more
   8 flexible because there was only one person doing the programming and there was
   9 a weekly meeting to discuss the matters and most importantly discuss the
  10 required changes. Because of this a lot of initial requirements are removed and
  11 a some requirements were added in the process. The list below shows the
  12 definitive requirements and also the suspended requirements.
  13
  14 There are two types of requirements, functional and non-functional
  15 requirements. Functional requirements are requirements that describe a certain
  16 function in the technical sense. Non-functional requirements describe a
  17 property. Properties can be for example efficiency, portability or
  18 compatibility. To make us able to refer to them later we give the
  19 requirements unique codes. As for the definitive requirements a verbose
  20 explanation is also provided.
  21
  22 \subsection{Functional requirements}
  23 \subsubsection{Original functional requirements}
  24 \begin{itemize}
  25         \item[F1:] Be able to crawl several source types.
  26                 \begin{itemize}
  27                         \item[F1a:] Fax/email.
  28                         \item[F1b:] XML feeds.
  29                         \item[F1c:] RSS feeds.
  30                         \item[F1d:] Websites.
  31                 \end{itemize}
  32         \item[F2:] Apply low level matching techniques on isolated data.
  33         \item[F3:] Insert the data in the database.
  34         \item[F4:] User interface to train crawlers that is usable someone
  35                 without a particular computer science background.
  36         \item[F5:] Control center for the crawlers.
  37         \item[F6:] Report to the user or maintainer when a source has been
  38                 changed too much for successful crawling.
  39 \end{itemize}
  40
  41 \subsubsection{Definitive functional requirements}
  42 Requirement F2 is the sole requirement that is dropped completely, this is
  43 due to the fact that it lies outside of the time available for the project.
  44 The less time available is partly because we chose to implement certain other
  45 requirements like an interactive intuitive user interface around the core of
  46 the pattern extraction program. All other requirements changed or kept the
  47 same. Below are all definitive requirements with on the first line the title
  48 and with a description underneath.
  49 \begin{itemize}
  50         \item[F7:] Be able to crawl RSS feeds.
  51
  52                 This requirement is an adapted version of the compound
  53                 requirements F1a-F1d. We stripped down from crawling four
  54                 different sources to only one source because of the scope of
  55                 the project. Most sources require an entirely different
  56                 strategy and therefore we could not easily combine them. The
  57                 full reason why we chose RSS feeds can be found in
  58                 Section~\ref{sec:whyrss}.
  59
  60         \item[F8:] Export the data to a strict XML feed.
  61
  62                 This requirement is an adapted version of requirement F3, this
  63                 is als done to make the scope smaller. We chose to no interact
  64                 with the database or the \textit{Temporum}. The application
  65                 however is able to output XML data that is formatted
  66                 following a string XSD scheme so that it is easy to import the
  67                 data in the database or \textit{Temporum}.
  68         \item[F9:] User interface to train crawlers that is usable someone
  69                 without a particular computer science background.
  70                 science people.
  71
  72                 This requirement is a combination of F4 and F5. At first the
  73                 user interface for adding and training crawlers was done via a
  74                 webinterface that was user friendly and usable by someone
  75                 without a particular computer science background as the
  76                 requirement stated. However in the first prototypes the control
  77                 center that could test, edit and remove crawlers was a command
  78                 line application and thus not very usable for the general
  79                 audience. This combined requirements asks for a single control
  80                 center that can do all previously described tasks with an
  81                 interface that is usable without prior knowledge in computer
  82                 science.
  83         \item[F10:] Report to the user or maintainer when a source has been
  84                 changed too much for successful crawling.
  85
  86                 This requirement was also present in the original requirements
  87                 and has not changed. When the crawler fails to crawl a source,
  88                 this can be due to any reason, a message is sent to the people
  89                 using the program so that they can edit or remove the faulty
  90                 crawler. Updating without the need of a programmer is essential
  91                 in shortening the feedbackloop explained in
  92                 Figure~\ref{feedbackloop}.
  93 \end{itemize}
  94
  95 \subsection{Non-functional requirements}
  96 \subsubsection{Original functional requirements}
  97 \begin{itemize}
  98         \item[N1:] Integrate in the original system.
  99         \item[N2:] Work in a modular fashion, thus be able to, in the future,
 100                 extend the program.
 101 \end{itemize}
 102
 103 \subsubsection{Active functional requirements}
 104 \begin{itemize}
 105         \item[N2:] Work in a modular fashion, thus be able to, in the future,
 106                 extend the program.
 107
 108                 The modularity is very important so that the components can be
 109                 easily extended and components can be added. Possible
 110                 extensions are discussed in Section~\ref{sec:discuss}.
 111         \item[N3:] Operate standalone on a server.
 112
 113                 Non-functional requirement N1 is dropped because we want to
 114                 keep the program as modular as possible and via an XML
 115                 interface we still have a very intimate connection with the
 116                 database without having to maintain a direct connection.
 117 \end{itemize}
 118
 119 \section{Application overview}
 120 The workflow of the application can be divided into several components or
 121 steps. The overview of the application is visible in Figure~\ref{appoverview}.
 122 The nodes are applications or processing steps and the arrows denote
 123 information flow or movement between nodes.
 124 \begin{figure}[H]
 125         \label{appoverview}
 126         \centering
 127         \includegraphics[width=\linewidth]{appoverview.pdf}
 128         \strut\\
 129         \caption{Overview of the application}
 130 \end{figure}
 131
 132 \subsection{Frontend}
 133 \subsubsection{General description}
 134 The frontend is a web interface that is connected to the backend applications
 135 which allows the user to interact with the backend. The frontend consists of a
 136 basic graphical user interface which is shown in Figure~\ref{frontendfront}. As
 137 the interface shows, there are three main components that the user can use.
 138 There is also an button for downloading the XML. The \textit{Get xml} button is
 139 a quick shortcut to make the backend to generate XML. The button for grabbing
 140 the XML  data is only for diagnostic purposes located there. In the standard
 141 workflow the XML button is not used. In the standard workflow the server
 142 periodically calls the XML output option from the command line interface of the
 143 backend to process it.
 144
 145 \begin{figure}[H]
 146         \label{frontendfront}
 147         \includegraphics[width=\linewidth]{frontendfront.pdf}
 148         \caption{The landing page of the frontend}
 149 \end{figure}
 150
 151 \subsubsection{Edit/Remove crawler}
 152 This component lets the user view the crawlers and remove the crawlers from the
 153 crawler database. Doing one of these things with a crawler is as simple as
 154 selecting the crawler from the dropdown menu and selecting the operation from
 155 the other dropdown menu and pressing \textit{Submit}.
 156
 157 Removing the crawler will remove the crawler completely from the crawler
 158 database and the crawler will be unrecoverable. Editing the crawler will open a
 159 similar screen as when adding the crawler. The details about that screen will
 160 be discussed in Section~\ref{addcrawler}. The only difference is that the
 161 previous trained patterns are already made visible in the training interface
 162 and can thus be adapted to change the crawler for possible source changes for
 163 example.
 164
 165 \subsubsection{Add new crawler}
 166 \label{addcrawler}
 167 The addition or generation of crawlers is the key feature of the program and it
 168 is the smartest part of whole system as it includes the graph optimization
 169 algorithm to recognize user specified patterns in the new data. First, the user
 170 must fill in the static form that is visible on top of the page. This for
 171 contains general information about the venue together with some crawler
 172 specific values such as crawling frequency. After that the user can mark
 173 certain points in the table as being of a category. Marking text is as easy as
 174 selecting the text and pressing the according button. The text visible in the
 175 table is a stripped down version of the original RSS feeds \texttt{title} and
 176 \texttt{summary} fields. When the text is marked it will be highlighted in the
 177 same color as the color of the button text. The entirety of the user interface
 178 with a few sample markings is shown in Figure~\ref{frontedfront}. After the
 179 marking of the categories the user can preview the data or submit. Previewing
 180 will run the crawler on the RSS feed in memory and the user can revise the
 181 patterns if necessary. Submitting will send the page to the backend to be
 182 processed. The internals of what happens after submitting is explained in
 183 detail in Figure~\ref{appinternals} together with the text.
 184
 185 \begin{figure}[H]
 186         \label{frontendfront}
 187         \centering
 188         \includegraphics[width=\linewidth]{crawlerpattern.pdf}
 189         \caption{A view of the interface for specifying the pattern. Two %
 190 entries are already marked.}
 191 \end{figure}
 192
 193 \subsubsection{Test crawler}
 194 The test crawler component is a very simple non interactive component that
 195 allows the user to verify if a crawler functions properly without having to
 196 access the database via the command line utilities. Via a dropdown menu the
 197 user selects the crawler and when submit is pressed the backend generates a
 198 results page that shows a small log of the crawler, a summary of the results
 199 and most importantly the results itself. In this way the user can see in a few
 200 gazes if the crawler functions properly. Humans are very fast in detecting
 201 patterns and therefore the error checking goes very fast. Because the log of
 202 the crawl operation is shown this page can also be used for diagnostic
 203 information about the backends crawling system. The logging is pretty in depth
 204 and also shows possible exceptions and is therefore also usable for the
 205 developers to diagnose problems.
 206
 207 \subsection{Backend}
 208 \subsubsection{Program description}
 209 The backend consists of a main module and a set of libraries all written in
 210 \textit{Python}\cite{Python}. The main module can, and is, be embedded in an
 211 apache HTTP-server\cite{apache} via the \textit{mod\_python} apache
 212 module\cite{Modpython}. The module \textit{mod\_python} allows handling for
 213 python code via HTTP and this allows us to integrate neatly with the
 214 \textit{Python} libraries. We chose \textit{Python} because of the rich set of
 215 standard libraries and solid cross platform capabilities. We chose specifically
 216 for \textit{Python} 2 because it is still the default \textit{Python} version
 217 on all major operating systems and stays supported until at least the year
 218 $2020$. This means that the program can function at least for 5 full years. The
 219 application consists of a main \textit{Python} module that is embedded in the
 220 HTTP-server. Finally there are some libraries and there is a standalone program
 221 that does the periodic crawling.
 222
 223 \subsubsection{Main module}
 224 The main module is the program that deals with the requests, controls the
 225 frontend, converts the data to patterns and sends the patterns to the crawler.
 226 The module serves the frontend in a modular fashion. For example the buttons
 227 and colors can be easily edited by a non programmer by just changing the
 228 appropriate values in a text file. In this way even when conventions change the
 229 program can still function without intervention of a programmer that needs to
 230 adapt the source.
 231
 232 \subsubsection{Libraries}
 233 The libraries are called by the main program and take care of all the hard
 234 work. Basically the libraries are a group of python scripts that for example
 235 minimize the graphs, transform the user data into machine readable data, export
 236 the crawled data to XML and much more.
 237
 238 \subsubsection{Standalone crawler}
 239 The crawler is a program that is used by the main module and technically is
 240 part of the libraries. The property in which the crawler stands out is the fact
 241 that it also can run on its own. The crawler has to run periodically by a
 242 server to literally crawl the websites. The main module communicates with the
 243 crawler when it is queried for XML data, when a new crawler is added or when
 244 data is edited. The crawler also offers a command line interface that has the
 245 same functionality as the web interface of the control center.
 246
 247 The crawler saves all the data in a database. The database is a simple
 248 dictionary where all the entries are hashed so that the crawler knows which
 249 ones are already present in the database and which ones are new. In this way
 250 the crawler does not have to process all the old entries when they appear in
 251 the feed. The RSS' GUID could also have been used but since it is an optional
 252 value in the feed not every feed uses the GUID and therefore it is not reliable
 253 to use it. The crawler also has a function to export the database to XML
 254 format. The XML output format is specified in an XSD\cite{Xsd} file for minimal
 255 ambiguity.
 256
 257 \subsubsection{XML \& XSD}
 258 XML is a file format that can describe data structures. XML can be accompanied
 259 by an XSD file that describes the format. An XSD file is in fact just another
 260 XML file that describes the format of XML files. Almost all programming
 261 languages have an XML parser built in and therefore it is a very versatile
 262 format that makes the eventual import to the database very easy. The most used
 263 languages also include XSD validation to detect XML errors, validity and
 264 completeness of XML files. This makes interfacing with the database and
 265 possible future programs even more easily. The XSD scheme used for this
 266 programs output can be found in the appendices in Listing~\ref{scheme.xsd}. The
 267 XML output can be queried via the HTTP interface that calls the crawler backend
 268 to crunch the latest crawled data into XML. It can also be acquired directly
 269 from the crawlers command line interface.