thesis2/2.requirementsanddesign.tex

   1 \section{Requirements}
   2 \subsection{Introduction}
   3 As almost every plan for an application starts with a set of requirements so
   4 will this application. Requirements are a set of goals within different
   5 categories that will define what the application has to be able to do and they
   6 are traditionally defined at the start of the project and not expected to
   7 change much. In the case of this project the requirements were a lot more
   8 flexible because there was only one person doing the programming and there was
   9 a weekly meeting to discuss the matters and most importantly discuss the
  10 required changes. Because of this a lot of initial requirements are removed and
  11 a some requirements were added in the process. The list below shows the
  12 definitive requirements and also the suspended requirements.
  13
  14 There are two types of requirements, functional and non-functional
  15 requirements. Functional requirements are requirements that describe a certain
  16 function in the technical sense. Non-functional requirements describe a
  17 property. Properties can be for example efficiency, portability or
  18 compatibility. To make us able to refer to them later we give the
  19 requirements unique codes. As for the definitive requirements a verbose
  20 explanation is also provided.
  21
  22 \subsection{Functional requirements}
  23 \subsubsection{Original functional requirements}
  24 \begin{itemize}
  25         \item[F1:] Be able to crawl several source types.
  26                 \begin{itemize}
  27                         \item[F1a:] Fax/email.
  28                         \item[F1b:] XML feeds.
  29                         \item[F1c:] RSS feeds.
  30                         \item[F1d:] Websites.
  31                 \end{itemize}
  32         \item[F2:] Apply low level matching techniques on isolated data.
  33         \item[F3:] Insert the data in the database.
  34         \item[F4:] User interface to train crawlers that is usable someone
  35                 without a particular computer science background.
  36         \item[F5:] Control center for the crawlers.
  37         \item[F6:] Report to the user or maintainer when a source has been
  38                 changed too much for successful crawling.
  39 \end{itemize}
  40
  41 \subsubsection{Definitive functional requirements}
  42 Requirement F2 is the sole requirement that is dropped completely, this is
  43 due to the fact that it lies outside of the time available for the project.
  44 The less time available is partly because we chose to implement certain other
  45 requirements like an interactive intuitive user interface around the core of
  46 the pattern extraction program. All other requirements changed or kept the
  47 same. Below are all definitive requirements with on the first line the title and
  48 with a description underneath.
  49 \begin{itemize}
  50         \item[F7:] Be able to crawl RSS feeds.
  51
  52                 This requirement is an adapted version of the compound
  53                 requirements F1a-F1d. We stripped down from crawling four
  54                 different sources to only one source because of the scope of
  55                 the project. Most sources require an entirely different
  56                 strategy and therefore we could not easily combine them. The
  57                 full reason why we chose RSS feeds can be found in
  58                 Section~\ref{sec:whyrss}.
  59
  60         \item[F8:] Export the data to a strict XML feed.
  61
  62                 This requirement is an adapted version of requirement F3, this
  63                 is als done to make the scope smaller. We chose to no interact
  64                 with the database or the \textit{Temporum}. The application
  65                 however is able to output XML data that is formatted
  66                 following a string XSD scheme so that it is easy to import the
  67                 data in the database or \textit{Temporum}.
  68         \item[F9:] User interface to train crawlers that is usable someone
  69                 without a particular computer science background.
  70                 science people.
  71
  72                 This requirement is a combination of F4 and F5. At first the
  73                 user interface for adding and training crawlers was done via a
  74                 webinterface that was user friendly and usable by someone
  75                 without a particular computer science background as the
  76                 requirement stated. However in the first prototypes the control
  77                 center that could test, edit and remove crawlers was a command
  78                 line application and thus not very usable for the general
  79                 audience. This combined requirements asks for a single control
  80                 center that can do all previously described tasks with an
  81                 interface that is usable without prior knowledge in computer
  82                 science.
  83         \item[F10:] Report to the user or maintainer when a source has been
  84                 changed too much for successful crawling.
  85
  86                 This requirement was also present in the original requirements
  87                 and has not changed. When the crawler fails to crawl a source,
  88                 this can be due to any reason, a message is sent to the people
  89                 using the program so that they can edit or remove the faulty
  90                 crawler. Updating without the need of a programmer is essential
  91                 in shortening the feedbackloop explained in
  92                 Figure~\ref{feedbackloop}.
  93 \end{itemize}
  94
  95 \subsection{Non-functional requirements}
  96 \subsubsection{Original functional requirements}
  97 \begin{itemize}
  98         \item[N1:] Integrate in the original system.
  99         \item[N2:] Work in a modular fashion, thus be able to, in the future,
 100                 extend the program.
 101 \end{itemize}
 102
 103 \subsubsection{Active functional requirements}
 104 \begin{itemize}
 105         \item[N2:] Work in a modular fashion, thus be able to, in the future,
 106                 extend the program.
 107
 108                 The modularity is very important so that the components can be
 109                 easily extended and components can be added. Possible
 110                 extensions are discussed in Section~\ref{sec:discuss}.
 111         \item[N3:] Operate standalone on a server.
 112
 113                 Non-functional requirement N1 is dropped because we want to
 114                 keep the program as modular as possible and via an XML
 115                 interface we still have a very intimate connection with the
 116                 database without having to maintain a direct connection.
 117 \end{itemize}
 118
 119 \section{Application overview}
 120 \begin{figure}[H]
 121         \label{appoverview}
 122         \centering
 123         \includegraphics[width=\linewidth]{appoverview.eps}
 124         \strut\\
 125         \caption{Overview of the application}
 126 \end{figure}
 127
 128 \subsection{Frontend}
 129 \subsubsection{General description}
 130 The frontend is a web interface that is connected to the backend applications
 131 which allows the user to interact with the backend. The frontend consists of a
 132 basic graphical user interface which is shown in Figure~\ref{frontendfront}. As
 133 the interface shows, there are three main components that the user can use.
 134 There is also an button for downloading the XML. The \textit{Get xml} button is
 135 a quick shortcut to make the backend to generate XML. The button for grabbing
 136 the XML  data is only for diagnostic purposes located there. In the standard
 137 workflow the XML button is not used. In the standard workflow the server
 138 periodically calls the XML output option from the command line interface of the
 139 backend to process it.
 140
 141 \begin{figure}[H]
 142         \label{frontendfront}
 143         \includegraphics[scale=0.75,natheight=160,natwidth=657]{frontendfront.png}
 144         \caption{The landing page of the frontend}
 145 \end{figure}
 146
 147 \subsubsection{Edit/Remove crawler}
 148 This component lets the user view the crawlers and remove the crawlers from the
 149 database. Doing one of these things with a crawler is as simple as selecting
 150 the crawler from the dropdown menu and selecting the operation from the
 151 other dropdown menu and pressing \textit{Submit}.
 152 Removing the crawler will remove the crawler completely from the crawler
 153 database and the crawler will be unrecoverable. Editing the crawler will open a
 154 similar screen as when adding the crawler. The details about that screen will
 155 be discussed in ~\ref{addcrawler}. The only difference is that the previous
 156 trained patterns are already made visible in the training interface and can
 157 thus be adapted to change the crawler for possible source changes for example.
 158
 159 \subsubsection{Add new crawler}
 160 \label{addcrawler}
 161 The addition or generation of crawlers is the key feature of the program and it
 162 is the smartest part of whole system as it includes the graph optimization
 163 algorithm to recognize user specified patterns in the data. The user has to
 164 assign a name to a RSS feed in the boxes and when the user presses submit the
 165 RSS feed is downloaded and prepared to be shown in the interactive editor. The
 166 editor consists of two components. The top most component allows the user to
 167 enter several fields of data concerning the venue, these are things like:
 168 address, crawling frequency and website. Below there is a table containing the
 169 processed RSS feed entries and a row of buttons allowing the user to mark
 170 certain parts of the entries as certain types. The user has to select a piece
 171 of an entry and then press the appropriate category button. The text will
 172 become highlighted and by doing this for several entries the program will have
 173 enough information to crawl the feed as shown in Figure~\ref{addcrawl}
 174
 175 \begin{figure}[H]
 176         \label{frontendfront}
 177         \includegraphics[width=0.7\linewidth,natheight=1298,natwidth=584]{crawlerpattern.png}
 178         \caption{A pattern selection of three entries}
 179 \end{figure}
 180
 181 \subsubsection{Test crawler}
 182 The test crawler component is a very simple non interactive component that
 183 allows the user to verify if a crawler functions properly without having to
 184 need to access the database or the command line utilities. Via a dropdown menu
 185 the user selects the crawler and when submit is pressed the backend generates a
 186 results page that shows a small log of the crawler, a summary of the results
 187 and most importantly the results, in this way the user can see in a few gazes
 188 if the crawler functions properly. Humans are very fast in detecting patterns
 189 and therefore the error checking goes very fast. Because the log of the crawl
 190 operation is shown this page can also be used for diagnostic information about
 191 the backends crawling system. The logging is pretty in depth and also shows
 192 possible exceptions and is therefore also usable for the developers to diagnose
 193 problems.
 194
 195 \subsection{Backend}
 196 \subsubsection{Program description}
 197 The backend consists of a main module and a set of libraries all written in
 198 \textit{Python}\cite{Python}. The main module can, and is, be embedded in an
 199 apache webserver\cite{apache} via the \textit{mod\_python} apache
 200 module\cite{Modpython}. The module \textit{mod\_python} allows the webserver to
 201 execute Python code in the webserver. We chose Python because of the rich set
 202 of standard libraries and solid cross platform capabilities. We chose Python 2
 203 because it is still the default Python version on all major operating systems
 204 and stays supported until at least the year 2020 meaning that the program can
 205 function safe at least 5 full years. The application consists of a main Python
 206 module that is embedded in the webserver. Finally there are some libraries and
 207 there is a standalone program that does the periodic crawling.
 208
 209 \subsubsection{Main module}
 210 The main module is the program that deals with the requests, controls the
 211 fronted, converts the data to patterns and sends it to the crawler. The
 212 module serves the frontend in a modular fashion. For example the buttons and
 213 colors can be easily edited by a non programmer by just changing some values in
 214 a text file. In this way even when conventions change the program can still
 215 function without intervention of a programmer that needs to adapt the source.
 216
 217 \subsubsection{Libraries}
 218 The libraries are called by the main program and take care of all the hard
 219 work. Basically the libraries are a group of python scripts that for example
 220 minimize the graphs, transform the user data into machine readable data, export
 221 the crawled data to XML and much more.
 222
 223 \subsubsection{Standalone crawler}
 224 The crawler is a program that is used by the main module and technically is
 225 part of the libraries. The thing the crawler stands out is the fact that it
 226 also can run on its own. The crawler has to be runned periodically by a server
 227 to really crawl the websites. The main module communicates with the crawler
 228 when it needs XML data, when a new crawler is added or when data is edited. The
 229 crawler also offers a command line interface that has the same functionality as
 230 the web interface of the control center.
 231
 232 The crawler saves all the data in a database. The database is a simple
 233 dictionary where all the entries are hashed so that the crawler knows which
 234 ones are already present in the database and which ones are new so that it
 235 does not have to process all the old entries when they appear in the feed. The
 236 RSS's GUID could also have been used but since it is an optional value in the
 237 feed not every feed uses the GUID and therefore it is not reliable to use it.
 238 The crawler also has a function to export the database to XML format. The XML
 239 format is specified in an XSD\cite{Xsd} file for minimal ambiguity.
 240
 241 \subsubsection{XML \& XSD}
 242 XML is a file format that can describe data structures. XML can be accompanied
 243 by an XSD file that describes the format. An XSD file is in fact just another
 244 XML file that describes the format of a class of XML files. Because almost all
 245 programming languages have an XML parser built in it is a very versatile format
 246 that makes the importing to the database very easy. The most used languages
 247 also include XSD validation to detect XML errors, validity and completeness of
 248 XML files. This makes interfacing with the database and possible future
 249 programs very easy. The XSD scheme used for this programs output can be found
 250 in the appendices in Listing~\ref{scheme.xsd}. The XML output can be queried
 251 via a http interface that calls the crawler backend to crunch the latest
 252 crawled data into XML.