thesis2/2.requirementsanddesign.tex

   1 \section{Requirements}
   2 \subsection{Introduction}
   3 As almost every other computer starts with a set of requirements so will this
   4 application. Requirements are a set of goals within different categories that
   5 will define what the application has to be able to do and they are
   6 traditionally defined at the start of the project and not expected to change
   7 much. In the case of this project the requirements were a lot more flexible
   8 because there was only one person doing the programming and there was a weekly
   9 meeting to discuss the matters and most importantly discuss the required
  10 changes. Because of this a lot of initial requirements are removed and a some
  11 requirements were added in the process. The list below shows the definitive
  12 requirements and also the suspended requirements.
  13
  14 The two types of requirements that are formed are functional and non-functional
  15 requirements. Respectively they are requirements that describe a certain
  16 function and the latter are requirements that describe a certain property such
  17 as efficiency or compatibility. To make us able to refer to them we give the
  18 requirements unique codes. We also specify in the list with active requirements
  19 the reason for the choice.
  20
  21 \subsection{Functional requirements}
  22 \subsubsection{Original functional requirements}
  23 \begin{itemize}
  24         \item[F1:] Be able to crawl several source types.
  25                 \begin{itemize}
  26                         \item[F1a:] Fax/email.
  27                         \item[F1b:] XML feeds.
  28                         \item[F1c:] RSS feeds.
  29                         \item[F1d:] Websites.
  30                 \end{itemize}
  31         \item[F2:] Apply low level matching techniques on isolated data.
  32         \item[F3:] Insert the data in the database.
  33         \item[F4:] User interface to train crawlers must be usable by non computer
  34                 science people.
  35         \item[F5:] There must be a control center for the crawlers.
  36 \end{itemize}
  37
  38 \subsubsection{Definitive functional requirements}
  39 Requirement F2 is the sole requirement that is dropped completely, this is
  40 because this seemed to lie out of the scope of the project. This is mainly
  41 because we chose to build an interactive intuitive user interface around the
  42 core of the pattern extraction program. All other requirements changed or kept
  43 the same. Below, all definitive requirements with on the first line the title
  44 and with a description underneath.
  45 \begin{itemize}
  46         \item[F6:] Be able to crawl RSS feeds only.
  47
  48                 This requirement is an adapted version of the compound requirements
  49                 F1a-F1d. We stripped down from crawling four different sources to only one
  50                 source because of the scope of the project. Most sources require an
  51                 entirely different strategy. The full reason why we chose RSS feeds can be
  52                 found in Section~\ref{sec:whyrss}.
  53
  54         \item[F7:] Export the data to a strict XML feed.
  55
  56                 This requirement is an adapted version of requirement F3, this to done to
  57                 make the scope smaller. We chose to no interact with the database or the
  58                 \textit{Temporum}. The application will have to be able to output XML data
  59                 that is formatted following a string XSD scheme so that it is easy to
  60                 import the data in the database or \textit{Temporum}.
  61         \item[F8:] A control center interface that is usable by non computer
  62                 science people.
  63
  64                 This requirement is a combination of F4 and F5. At first the user interface
  65                 for adding and training crawlers was done via a webinterface that was user
  66                 friendly and usable for non computer science people as the requirement
  67                 stated. However in the first prototypes the control center that could test,
  68                 edit and remove crawlers was a command line application and thus not very
  69                 usable for the general audience. This combined requirements asks for a
  70                 single control center that can do all previously described task with an
  71                 interface that is usable by almost everyone.
  72         \item[F9:] Report to the user or maintainer when a source has been changed
  73                 too much for successful crawling.
  74
  75                 This requirement was also present in the original requirements and hasn't
  76                 changed. When the crawler fails to crawl a source, this can be due to any
  77                 reason, a message is sent to the people using the program so that they can
  78                 edit or remove the faulty crawler. This is a crucial component because the
  79                 program, a non computer science person can do this task and is essential in
  80                 shortening the feedback loop explained in Figure~\ref{fig:1.1.2}.
  81 \end{itemize}
  82
  83 \subsection{Non-functional requirements}
  84 \subsubsection{Original functional requirements}
  85 \begin{itemize}
  86         \item[N1:] Integrate in the original system.
  87         \item[N2:] Work in a modular fashion, thus be able to, in the future, extend
  88                 the program.
  89 \end{itemize}
  90
  91 \subsubsection{Active functional requirements}
  92 \begin{itemize}
  93         \item[N2:] Work in a modular fashion, thus be able to, in the future, extend
  94                 the program.
  95
  96                 The modularity is very important so that the components can be easily
  97                 extended and components can be added. Possible extensions are discussed in
  98                 Section~\ref{sec:discuss}.
  99         \item[N3:] Operate standalone on a server.
 100
 101                 Non-functional requirement N1 is dropped because we want to keep the
 102                 program as modular as possible and via an XML interface we still have a
 103                 very stable connection with the database but we avoided getting entangled
 104                 in the software managing the database.
 105 \end{itemize}
 106
 107 \section{Design}
 108 \subsection{Frontend}
 109 \subsubsection{General description}
 110 The frontend is a web interface to the backend applications that allow the user
 111 to interact with the backend by for example adding crawlers. The frontend
 112 consists of a basic graphical user interface that is shown in
 113 Figure~\ref{frontendfront}. As the interface shows, there are three main
 114 components that the user can use. There is also an button for downloading the
 115 XML. The XML output is a quick shortcut to make the backend to generate XML.
 116 However the XML button is only for diagnostic purposes located there. In the
 117 standard workflow the XML button is not used. In the standard workflow the
 118 server periodically calls the XML output from the backend to process it.
 119
 120 \begin{figure}[H]
 121         \caption{The landing page of the frontend}
 122         \label{frontendfront}
 123         \includegraphics[scale=0.75,natheight=160,natwidth=657]{frontendfront.png}
 124 \end{figure}
 125
 126 \subsubsection{Edit/Remove crawler}
 127 This component lets the user view the crawlers and remove the crawlers from the
 128 database. Doing one of these things with a crawler is as simple as selecting
 129 the crawler from the dropdown menu and selecting the operation from the
 130 other dropdown menu and pressing \textit{Submit}.
 131 Removing the crawler will remove the crawler completely from the crawler
 132 database and the crawler will be unrecoverable. Editing the crawler will open a
 133 similar screen as when adding the crawler. The details about that screen will
 134 be discussed in ~\ref{addcrawler}. The only difference is that the previous
 135 trained patterns are already made visible in the training interface and can
 136 thus be adapted to change the crawler for possible source changes for example.
 137
 138 \subsubsection{Add new crawler}
 139 \label{addcrawler}
 140 The addition or generation of crawlers is the key feature of the program and it
 141 is the smartest part of whole system as it includes the graph optimization
 142 algorithm to recognize user specified patterns in the data. The user has to
 143 assign a name to a RSS feed in the boxes and when the user presses submit the
 144 RSS feed is downloaded and prepared to be shown in the interactive editor. The
 145 editor consists of two components. The top most component allows the user to
 146 enter several fields of data concerning the venue, these are things like:
 147 address, crawling frequency and website. Below there is a table containing the
 148 processed RSS feed entries and a row of buttons allowing the user to mark
 149 certain parts of the entries as certain types. The user has to select a piece
 150 of an entry and then press the appropriate category button. The text will
 151 become highlighted and by doing this for several entries the program will have
 152 enough information to crawl the feed as shown in Figure~\ref{addcrawl}
 153
 154 \begin{figure}
 155         \label{addcrawl}
 156         \caption{Example of a pattern in the add crawler component}
 157 \end{figure}
 158
 159 \subsubsection{Test crawler}
 160 The test crawler component is a very simple non interactive component that
 161 allows the user to verify if a crawler functions properly without having to
 162 need to access the database or the command line utilities. Via a dropdown menu
 163 the user selects the crawler and when submit is pressed the backend generates a
 164 results page that shows a small log of the crawler, a summary of the results
 165 and most importantly the results, in this way the user can see in a few gazes
 166 if the crawler functions properly. Humans are very fast in detecting patterns
 167 and therefore the error checking goes very fast. Because the log of the crawl
 168 operation is shown this page can also be used for diagnostic information about
 169 the backends crawling system. The logging is pretty in depth and also shows
 170 possible exceptions and is therefore also usable for the developers to diagnose
 171 problems.
 172
 173 \subsection{Backend}
 174 \subsubsection{Program description}
 175 The backend consists of a main module and a set of libraries all written in
 176 \textit{Python}\cite{Python}. The main module can, and is, be embedded in an
 177 apache webserver\cite{apache} via the \textit{mod\_python} apache
 178 module\cite{Modpython}. The module \textit{mod\_python} allows the webserver to
 179 execute Python code in the webserver. We chose Python because of the rich set
 180 of standard libraries and solid cross platform capabilities. We chose Python 2
 181 because it is still the default Python version on all major operating systems
 182 and stays supported until at least the year 2020 meaning that the program can
 183 function safe at least 5 full years. The application consists of a main Python
 184 module that is embedded in the webserver. Finally there are some libraries and
 185 there is a standalone program that does the periodic crawling.
 186
 187 \subsubsection{Main module}
 188 The main module is the program that deals with the requests, controls the
 189 fronted, converts the data to patterns and sends it to the crawler. The
 190 module serves the frontend in a modular fashion. For example the buttons and
 191 colors can be easily edited by a non programmer by just changing some values in
 192 a text file. In this way even when conventions change the program can still
 193 function without intervention of a programmer that needs to adapt the source.
 194
 195 \subsubsection{Libraries}
 196 The libraries are called by the main program and take care of all the hard
 197 work. Basically the libraries are a group of python scripts that for example
 198 minimize the graphs, transform the user data into machine readable data, export
 199 the crawled data to XML and much more.
 200
 201 \subsubsection{Standalone crawler}
 202 The crawler is a program that is used by the main module and technically is
 203 part of the libraries. The thing the crawler stands out is the fact that it
 204 also can run on its own. The crawler has to be runned periodically by a server
 205 to really crawl the websites. The main module communicates with the crawler
 206 when it needs XML data, when a new crawler is added or when data is edited. The
 207 crawler also offers a command line interface that has the same functionality as
 208 the web interface of the control center.
 209
 210 The crawler saves all the data in a database. The database is a simple
 211 dictionary where all the entries are hashed so that the crawler knows which
 212 ones are already present in the database and which ones are new so that it
 213 does not have to process all the old entries when they appear in the feed. The
 214 RSS's GUID could also have been used but since it is an optional value in the
 215 feed not every feed uses the GUID and therefore it is not reliable to use it.
 216 The crawler also has a function to export the database to XML format. The XML
 217 format is specified in an XSD\cite{Xsd} file for minimal ambiguity.
 218
 219 \subsubsection{XML \& XSD}
 220 XML is a file format that can describe data structures. XML can be accompanied
 221 by an XSD file that describes the format. An XSD file is in fact just another
 222 XML file that describes the format of a class of XML files. Because almost all
 223 programming languages have an XML parser built in it is a very versatile format
 224 that makes the importing to the database very easy. The most used languages
 225 also include XSD validation to detect XML errors, validity and completeness of
 226 XML files. This makes interfacing with the database and possible future
 227 programs very easy. The XSD scheme used for this programs output can be found
 228 in the appendices in Listing~\ref{scheme.xsd}. The XML output can be queried
 229 via a http interface that calls the crawler backend to crunch the latest
 230 crawled data into XML.