thesis2/2.requirementsanddesign.tex

   1 \section{Requirements}
   2 \subsection{Introduction}
   3 As almost every other computer starts with a set of requirements so will this
   4 application. Requirements are a set of goals within different categories that
   5 will define what the application has to be able to do and they are
   6 traditionally defined at the start of the project and not expected to change
   7 much. In the case of this project the requirements were a lot more flexible
   8 because there was only one person doing the programming and there was a weekly
   9 meeting to discuss the matters and most importantly discuss the required
  10 changes. Because of this a lot of initial requirements are removed and a some
  11 requirements were added in the process. The list below shows the definitive
  12 requirements and also the suspended requirements.
  13
  14 The two types of requirements that are formed are functional and non-functional
  15 requirements. Respectively they are requirements that describe a certain
  16 function and the latter are requirements that describe a certain property such
  17 as efficiency or compatibility. To make us able to refer to them we give the
  18 requirements unique codes. We also specify in the list with active requirements
  19 the reason for the choice.
  20
  21 \subsection{Functional requirements}
  22 \subsubsection{Original functional requirements}
  23 \begin{itemize}
  24         \item[F1:] Be able to crawl several source types.
  25                 \begin{itemize}
  26                         \item[F1a:] Fax/email.
  27                         \item[F1b:] XML feeds.
  28                         \item[F1c:] RSS feeds.
  29                         \item[F1d:] Websites.
  30                 \end{itemize}
  31         \item[F2:] Apply low level matching techniques on isolated data.
  32         \item[F3:] Insert the data in the database.
  33         \item[F4:] User interface to train crawlers must be usable by non computer
  34                 science people.
  35         \item[F5:] There must be a control center for the crawlers.
  36 \end{itemize}
  37
  38 \subsubsection{Definitive functional requirements}
  39 Requirements F2 is the sole requirement that is dropped completely. All other
  40 definitive requirements are formed out of the original functional requirements.
  41 Together they make the following definitive requirements:
  42 \begin{itemize}
  43         \item[F6:] Be able to crawl RSS feeds only.
  44
  45                 This requirement is an adapted version of the compound requirements
  46                 F1a-F1d. We stripped down from crawling four different sources to only one
  47                 source because of the scope of the project. Most sources require an
  48                 entirely different strategy. The full reason why we chose RSS feeds can be
  49                 found in Section~\ref{sec:whyrss}.
  50
  51         \item[F7:] Export the data to a strict XML feed.
  52
  53                 This requirement is an adapted version of requirement F3, this to done to
  54                 make the scope smaller. We chose to no interact with the database or the
  55                 \textit{Temporum}. The application will have to be able to output XML data
  56                 that is formatted following a string XSD scheme so that it is easy to
  57                 import the data in the database or \textit{Temporum}.
  58         \item[F8:] A control center interface that is usable by non computer
  59                 science people.
  60
  61                 This requirement is a combination of F4 and F5. At first the user interface
  62                 for adding and training crawlers was done via a webinterface that was user
  63                 friendly and usable for non computer science people as the requirement
  64                 stated. However in the first prototypes the control center that could test,
  65                 edit and remove crawlers was a command line application and thus not very
  66                 usable for the general audience. This combined requirements asks for a
  67                 single control center that can do all previously described task with an
  68                 interface that is usable by almost everyone.
  69         \item[F9:] Report to the user or maintainer when a source has been changed
  70                 too much for successful crawling.
  71
  72                 This requirement was also present in the original requirements and hasn't
  73                 changed. When the crawler fails to crawl a source, this can be due to any
  74                 reason, a message is sent to the people using the program so that they can
  75                 edit or remove the faulty crawler. This is a crucial component because the
  76                 program, a non computer science person can do this task and is essential in
  77                 shortening the feedback loop explained in Figure~\ref{fig:1.1.2}.
  78 \end{itemize}
  79
  80 \subsection{Non-functional requirements}
  81 \subsubsection{Original functional requirements}
  82 \begin{itemize}
  83         \item[N1:] Integrate in the original system.
  84         \item[N2:] Work in a modular fashion, thus be able to, in the future, extend
  85                 the program.
  86 \end{itemize}
  87
  88 \subsubsection{Active functional requirements}
  89 \begin{itemize}
  90         \item[N2:] Work in a modular fashion, thus be able to, in the future, extend
  91                 the program.
  92
  93                 The modularity is very important so that the components can be easily
  94                 extended and components can be added. Possible extensions are discussed in
  95                 Section~\ref{sec:discuss}.
  96         \item[N3:] Operate standalone on a server.
  97
  98                 Non-functional requirement N1 is dropped because we want to keep the
  99                 program as modular as possible and via an XML interface we still have a
 100                 very stable connection with the database but we avoided getting entangled
 101                 in the software managing the database.
 102 \end{itemize}
 103
 104 \section{Design}
 105 \subsection{Frontend}
 106 We explain the design of the frontend application through examples and use
 107 cases. In this way we can explain certain design choices visually and more
 108 specific.
 109
 110 \subsection{Backend}
 111 \subsubsection{Program description}
 112 The backend consists of a main module and a set of libraries all written in
 113 \textit{Python}\footnote{\url{https://www.python.org/}}. The main module can,
 114 and is, be embedded in an apache
 115 webserver\footnote{\url{https://httpd.apache.org/}} via the
 116 \textit{mod\_python} apache module\footnote{\url{http://modpython.org/}}.  The
 117 module \textit{mod\_python} allows the webserver to execute \textit{Python}
 118 code in the webserver. We chose Python because of the rich set of standard
 119 libraries and solid cross platform capabilities. We chose Python 2 because it
 120 is still the default Python version on all major operating systems and stays
 121 supported until at least the year 2020 meaning that the program can function
 122 safe at least 5 full years. The application consists of a main Python module
 123 that is embedded in the webserver. Finally there are some libraries and there
 124 is a standalone program that does the periodic crawling.
 125
 126 \subsubsection{Main module}
 127 The main module is the program that deals with the requests, controls the
 128 fronted, converts the data to patterns and sends it to the crawler. The
 129 module serves the frontend in a modular fashion. For example the buttons and
 130 colors can be easily edited by a non programmer by just changing some values in
 131 a text file. In this way even when conventions change the program can still
 132 function without intervention of a programmer that needs to adapt the source.
 133
 134 \subsubsection{Libraries}
 135 The libraries are called by the main program and take care of all the hard
 136 work. Basically the libaries are a group of python scripts that for example
 137 minimize the graphs, transform the user data into machine readable data, export
 138 the crawled data to XML and much more.
 139
 140 \subsubsection{Standalone crawler}
 141 The crawler is a program that is used by the main module and technically is
 142 part of the libraries. The thing the crawler stands out is the fact that it
 143 also can run on its own. The crawler has to be runned periodically by a server
 144 to really crawl the websites. The main module communicates with the crawler
 145 when it needs XML data, when a new crawler is added or when data is edited. The
 146 crawler also offers a command line interface that has the same functionality as
 147 the web interface of the control center.
 148
 149 The crawler saves all the data in a database. The database is a simple
 150 dictionary where all the entries are hashed so that the crawler knows which
 151 ones are already present in the database and which ones are new so that it
 152 does not have to process all the old entries when they appear in the feed. The
 153 crawler also has a function to export the database to XML format. The XML
 154 format is specified in an XSD file for minimal ambiguity.