thesis2/2.requirementsanddesign.tex

   1 \section{Requirements}
   2 \subsection{Introduction}
   3 As almost every plan for an application starts with a set of requirements so
   4 will this application. Requirements are a set of goals within different
   5 categories that will define what the application has to be able to do and they
   6 are traditionally defined at the start of the project and not expected to
   7 change much. In the case of this project the requirements were a lot more
   8 flexible because there was only one person doing the programming and there was
   9 a weekly meeting to discuss the matters and most importantly discuss the
  10 required changes. Because of this a lot of initial requirements are removed and
  11 a some requirements were added in the process. The list below shows the
  12 definitive requirements and also the suspended requirements.
  13
  14 There are two types of requirements, functional and non-functional
  15 requirements. Functional requirements are requirements that describe a certain
  16 function in the technical sense. Non-functional requirements describe a
  17 property. Properties can be for example efficiency, portability or
  18 compatibility. To make us able to refer to them later we give the
  19 requirements unique codes. As for the definitive requirements a verbose
  20 explanation is also provided.
  21
  22 \subsection{Functional requirements}
  23 \subsubsection{Original functional requirements}
  24 \begin{itemize}
  25         \item[I1:] The system should be able to crawl several source types.
  26                 \begin{itemize}
  27                         \item[I1a:] Fax/email.
  28                         \item[I1b:] XML feeds.
  29                         \item[I1c:] RSS feeds.
  30                         \item[I1d:] Websites.
  31                 \end{itemize}
  32         \item[I2:] Apply low level matching techniques on isolated data.
  33         \item[I3:] Insert data in the database.
  34         \item[I4:] The system should have an user interface to train crawlers that is
  35                 usable someone without a particular computer science background.
  36         \item[I5:] The system should be able to report to the user or
  37                 maintainer when a source has been changed too much for
  38                 successful crawling.
  39 \end{itemize}
  40
  41 \subsubsection{Definitive functional requirements}
  42 Requirement I2 is the one requirement that is dropped completely, this is due
  43 to time constraints. The time limitation is partly because we chose to
  44 implement certain other requirements like an interactive intuitive user
  45 interface around the core of the pattern extraction program. Below are all
  46 definitive requirements.
  47 \begin{itemize}
  48         \item[F1:] The system should be able to crawl RSS feeds.
  49
  50                 This requirement is an adapted version of the compound
  51                 requirements I1a-I1d. We limited the source types to crawl to
  52                 strict RSS because of the time constraints of the project. Most
  53                 sources require an entirely different strategy and therefore we
  54                 could not easily combine them. an explanation why we chose RSS
  55                 feeds can be found in Section~\ref{sec:whyrss}.
  56
  57         \item[F2:] Export the data to a strict XML feed.
  58
  59                 This requirement is an adapted version of requirement I3, this
  60                 is also done to limit the scope. We chose to no interact
  61                 directly with the database or the \textit{Temporum}. The
  62                 application however is able to output XML data that is
  63                 formatted following a string XSD scheme so that it is easy to
  64                 import the data in the database or \textit{Temporum} in a
  65                 indirect way.
  66         \item[F3:] The system should have a user interface to create crawlers
  67                 that is usable someone without a particular computer science
  68                 background.  science people.
  69
  70                 This requirement is formed from I4. Initially the user
  71                 interface for adding and training crawlers was done via a
  72                 web interface that was user friendly and usable by someone
  73                 without a particular computer science background as the
  74                 requirement stated. However in the first prototypes the control
  75                 center that could test, edit and remove crawlers was a command
  76                 line application and thus not very usable for the general
  77                 audience. This combined requirements asks for a single control
  78                 center that can do all previously described tasks with an
  79                 interface that is usable without prior knowledge in computer
  80                 science.
  81         \item[F4:] Report to the user or maintainer when a source has been
  82                 changed too much for successful crawling.
  83
  84                 This requirement was also present in the original requirements
  85                 and has not changed. When the crawler fails to crawl a source,
  86                 this can be due to any reason, a message is sent to the people
  87                 using the program so that they can edit or remove the faulty
  88                 crawler. Updating without the need of a programmer is essential
  89                 in shortening the feedback loop explained in
  90                 Figure~\ref{feedbackloop}.
  91 \end{itemize}
  92
  93 \subsection{Non-functional requirements}
  94 \subsubsection{Original functional requirements}
  95 \begin{itemize}
  96         \item[O1:] Integrate in the existing system used by Hyperleap.
  97         \item[O2:] The system should work in a modular fashion, thus be able
  98                 to, in the future, extend the program.
  99 \end{itemize}
 100
 101 \subsubsection{Definitive functional requirements}
 102 \begin{itemize}
 103         \item[N1:] Work in a modular fashion, thus be able to, in the future,
 104                 extend the program.
 105
 106                 The modularity is very important so that the components can be
 107                 easily extended and components can be added. Possible
 108                 extensions are discussed in Section~\ref{sec:discuss}.
 109         \item[N2:] Operate standalone on a server.
 110
 111                 Non-functional requirement O1 is dropped because we want to
 112                 keep the program as modular as possible and via an XML
 113                 interface we still have a very intimate connection with the
 114                 database without having to maintain a direct connection. The
 115                 downside of an indirect connection instead of a direct
 116                 connection is that the specification is much more rigid. If the
 117                 system changes the specification the backend program should
 118                 also change.
 119 \end{itemize}
 120
 121 \section{Application overview}
 122 The workflow of the application can be divided into several components or
 123 steps. The overview of the application is visible in Figure~\ref{appoverview}.
 124 The nodes are applications or processing steps and the arrows denote
 125 information flow or movement between nodes.
 126 \begin{figure}[H]
 127         \label{appoverview}
 128         \centering
 129         \includegraphics[width=\linewidth]{appoverview.pdf}
 130         \caption{Overview of the application}
 131 \end{figure}
 132
 133 \subsection{Frontend}
 134 \subsubsection{General description}
 135 The frontend is a web interface that is connected to the backend system which
 136 allows the user to interact with the backend. The frontend consists of a basic
 137 graphical user interface which is shown in Figure~\ref{frontendfront}. As the
 138 interface shows, there are three main components that the user can use.  There
 139 is also an button for downloading the XML. The \textit{Get xml} button is a
 140 quick shortcut to make the backend to generate XML. The button for grabbing the
 141 XML data is only for diagnostic purposes located there. In the standard
 142 workflow the XML button is not used. In the standard workflow the server
 143 periodically calls the XML output option from the command line interface of the
 144 backend to process it.
 145
 146 \begin{figure}[H]
 147         \label{frontendfront}
 148         \includegraphics[width=\linewidth]{frontendfront.pdf}
 149         \caption{The landing page of the frontend}
 150 \end{figure}
 151
 152 \subsubsection{Edit/Remove crawler}
 153 This component lets the user view the crawlers and remove the crawlers from the
 154 crawler database. Doing one of these things with a crawler is as simple as
 155 selecting the crawler from the dropdown menu and selecting the operation from
 156 the other dropdown menu and pressing \textit{Submit}.
 157
 158 Removing the crawler will remove the crawler completely from the crawler
 159 database and the crawler will be unrecoverable. Editing the crawler will open a
 160 similar screen as when adding the crawler. The details about that screen will
 161 be discussed in Section~\ref{addcrawler}. The only difference is that the
 162 previous trained patterns are already made visible in the training interface
 163 and can thus be adapted to change the crawler for possible source changes for
 164 example.
 165
 166 \subsubsection{Add new crawler}
 167 \label{addcrawler}
 168 The addition or generation of crawlers is the key feature of the program and it
 169 is the intelligent part of the system since it includes the graph optimization
 170 algorithm to recognize user specified patterns in the new data. First, the user
 171 must fill in the static form that is visible on top of the page. This for
 172 contains general information about the venue together with some crawler
 173 specific values such as crawling frequency. After that the user can mark
 174 certain points in the table as being of a category. Marking text is as easy as
 175 selecting the text and pressing the according button. The text visible in the
 176 table is a stripped down version of the original RSS feeds \texttt{title} and
 177 \texttt{summary} fields. When the text is marked it will be highlighted in the
 178 same color as the color of the button text. The entirety of the user interface
 179 with a few sample markings is shown in Figure~\ref{frontendfront}. After the
 180 marking of the categories the user can preview the data or submit. Previewing
 181 will run the crawler on the RSS feed in memory and the user can revise the
 182 patterns if necessary. Submitting will send the page to the backend to be
 183 processed. The internals of what happens after submitting is explained in
 184 detail in Figure~\ref{appinternals} together with the text.
 185
 186 \begin{figure}[H]
 187         \label{frontendfront}
 188         \centering
 189         \includegraphics[width=\linewidth]{crawlerpattern.pdf}
 190         \caption{A view of the interface for specifying the pattern. Two %
 191 entries are already marked.}
 192 \end{figure}
 193
 194 \subsubsection{Test crawler}
 195 The test crawler component is a very simple non interactive component that
 196 allows the user to verify if a crawler functions properly without having to
 197 access the database via the command line utilities. Via a dropdown menu the
 198 user selects the crawler and when submit is pressed the backend generates a
 199 results page that shows a small log of the crawler, a summary of the results
 200 and most importantly the results itself. In this way the user can see in a few
 201 gazes if the crawler functions properly. Humans are very fast in detecting
 202 patterns and therefore the error checking goes very fast. Because the log of
 203 the crawl operation is shown this page can also be used for diagnostic
 204 information about the backends crawling system. The logging is pretty in depth
 205 and also shows possible exceptions and is therefore also usable for the
 206 developers to diagnose problems.
 207
 208 \subsection{Backend}
 209 \subsubsection{Program description}
 210 The backend consists of a main module and a set of libraries all written in
 211 \textit{Python}\cite{Python}. The main module can, and is, be embedded in an
 212 apache HTTP-server\cite{apache} via the \textit{mod\_python} apache
 213 module\cite{Modpython}. The module \textit{mod\_python} allows handling for
 214 python code via HTTP and this allows us to integrate neatly with the
 215 \textit{Python} libraries. We chose \textit{Python} because of the rich set of
 216 standard libraries and solid cross platform capabilities. We chose specifically
 217 for \textit{Python} 2 because it is still the default \textit{Python} version
 218 on all major operating systems and stays supported until at least the year
 219 $2020$. This means that the program can function at least for 5 full years. The
 220 application consists of a main \textit{Python} module that is embedded in the
 221 HTTP-server. Finally there are some libraries and there is a standalone program
 222 that does the periodic crawling.
 223
 224 \subsubsection{Main module}
 225 The main module is the program that deals with the requests, controls the
 226 frontend, converts the data to patterns and sends the patterns to the crawler.
 227 The module serves the frontend in a modular fashion. For example the buttons
 228 and colors can be easily edited by a non programmer by just changing the
 229 appropriate values in a text file. In this way even when conventions change the
 230 program can still function without intervention of a programmer that needs to
 231 adapt the source.
 232
 233 \subsubsection{Libraries}
 234 The libraries are called by the main program and take care of all the hard
 235 work. Basically the libraries are a group of python scripts that for example
 236 minimize the graphs, transform the user data into machine readable data, export
 237 the crawled data to XML and much more.
 238
 239 \subsubsection{Standalone crawler}
 240 The crawler is a program, also written in Python, that is used by the main
 241 module and technically is part of the libraries. The property in which the
 242 crawler stands out is the fact that it also can run on its own. The crawler has
 243 to run periodically by a server to literally crawl the websites. The main
 244 module communicates with the crawler when it is queried for XML data, when a
 245 new crawler is added or when data is edited. The crawler also offers a command
 246 line interface that has the same functionality as the web interface of the
 247 control center.
 248
 249 The crawler saves all the data in a database. The database is a simple
 250 dictionary where all the entries are hashed so that the crawler knows which
 251 ones are already present in the database and which ones are new. In this way
 252 the crawler does not have to process all the old entries when they appear in
 253 the feed. The RSS' GUID could also have been used but since it is an optional
 254 value in the feed not every feed uses the GUID and therefore it is not reliable
 255 to use it. The crawler also has a function to export the database to XML
 256 format. The XML output format is specified in an XSD\cite{Xsd} file for minimal
 257 ambiguity.
 258
 259 \subsubsection{XML \& XSD}
 260 XML is a file format that can describe data structures. XML can be accompanied
 261 by an XSD file that describes the format. An XSD file is in fact just another
 262 XML file that describes the format of XML files. Almost all programming
 263 languages have an XML parser built in and therefore it is a very versatile
 264 format that makes the eventual import to the database very easy. The most used
 265 languages also include XSD validation to detect XML errors, validity and
 266 completeness of XML files. This makes interfacing with the database and
 267 possible future programs even more easily. The XSD scheme used for this
 268 programs output can be found in the appendices in Listing~\ref{scheme.xsd}. The
 269 XML output can be queried via the HTTP interface that calls the crawler backend
 270 to crunch the latest crawled data into XML. It can also be acquired directly
 271 from the crawlers command line interface.