update with eps etc
[bsc-thesis1415.git] / thesis2 / 2.requirementsanddesign.tex
1 \section{Requirements}
2 \subsection{Introduction}
3 As almost every plan for an application starts with a set of requirements so
4 will this application. Requirements are a set of goals within different
5 categories that will define what the application has to be able to do and they
6 are traditionally defined at the start of the project and not expected to
7 change much. In the case of this project the requirements were a lot more
8 flexible because there was only one person doing the programming and there was
9 a weekly meeting to discuss the matters and most importantly discuss the
10 required changes. Because of this a lot of initial requirements are removed and
11 a some requirements were added in the process. The list below shows the
12 definitive requirements and also the suspended requirements.
13
14 There are two types of requirements, functional and non-functional
15 requirements. Functional requirements are requirements that describe a certain
16 function in the technical sense. Non-functional requirements describe a
17 property. Properties can be for example efficiency, portability or
18 compatibility. To make us able to refer to them later we give the
19 requirements unique codes. As for the definitive requirements a verbose
20 explanation is also provided.
21
22 \subsection{Functional requirements}
23 \subsubsection{Original functional requirements}
24 \begin{itemize}
25 \item[F1:] Be able to crawl several source types.
26 \begin{itemize}
27 \item[F1a:] Fax/email.
28 \item[F1b:] XML feeds.
29 \item[F1c:] RSS feeds.
30 \item[F1d:] Websites.
31 \end{itemize}
32 \item[F2:] Apply low level matching techniques on isolated data.
33 \item[F3:] Insert the data in the database.
34 \item[F4:] User interface to train crawlers that is usable someone
35 without a particular computer science background.
36 \item[F5:] Control center for the crawlers.
37 \item[F6:] Report to the user or maintainer when a source has been
38 changed too much for successful crawling.
39 \end{itemize}
40
41 \subsubsection{Definitive functional requirements}
42 Requirement F2 is the sole requirement that is dropped completely, this is
43 due to the fact that it lies outside of the time available for the project.
44 The less time available is partly because we chose to implement certain other
45 requirements like an interactive intuitive user interface around the core of
46 the pattern extraction program. All other requirements changed or kept the
47 same. Below are all definitive requirements with on the first line the title and
48 with a description underneath.
49 \begin{itemize}
50 \item[F7:] Be able to crawl RSS feeds.
51
52 This requirement is an adapted version of the compound
53 requirements F1a-F1d. We stripped down from crawling four
54 different sources to only one source because of the scope of
55 the project. Most sources require an entirely different
56 strategy and therefore we could not easily combine them. The
57 full reason why we chose RSS feeds can be found in
58 Section~\ref{sec:whyrss}.
59
60 \item[F8:] Export the data to a strict XML feed.
61
62 This requirement is an adapted version of requirement F3, this
63 is als done to make the scope smaller. We chose to no interact
64 with the database or the \textit{Temporum}. The application
65 however is able to output XML data that is formatted
66 following a string XSD scheme so that it is easy to import the
67 data in the database or \textit{Temporum}.
68 \item[F9:] User interface to train crawlers that is usable someone
69 without a particular computer science background.
70 science people.
71
72 This requirement is a combination of F4 and F5. At first the
73 user interface for adding and training crawlers was done via a
74 webinterface that was user friendly and usable by someone
75 without a particular computer science background as the
76 requirement stated. However in the first prototypes the control
77 center that could test, edit and remove crawlers was a command
78 line application and thus not very usable for the general
79 audience. This combined requirements asks for a single control
80 center that can do all previously described tasks with an
81 interface that is usable without prior knowledge in computer
82 science.
83 \item[F10:] Report to the user or maintainer when a source has been
84 changed too much for successful crawling.
85
86 This requirement was also present in the original requirements
87 and has not changed. When the crawler fails to crawl a source,
88 this can be due to any reason, a message is sent to the people
89 using the program so that they can edit or remove the faulty
90 crawler. Updating without the need of a programmer is essential
91 in shortening the feedbackloop explained in
92 Figure~\ref{feedbackloop}.
93 \end{itemize}
94
95 \subsection{Non-functional requirements}
96 \subsubsection{Original functional requirements}
97 \begin{itemize}
98 \item[N1:] Integrate in the original system.
99 \item[N2:] Work in a modular fashion, thus be able to, in the future,
100 extend the program.
101 \end{itemize}
102
103 \subsubsection{Active functional requirements}
104 \begin{itemize}
105 \item[N2:] Work in a modular fashion, thus be able to, in the future,
106 extend the program.
107
108 The modularity is very important so that the components can be
109 easily extended and components can be added. Possible
110 extensions are discussed in Section~\ref{sec:discuss}.
111 \item[N3:] Operate standalone on a server.
112
113 Non-functional requirement N1 is dropped because we want to
114 keep the program as modular as possible and via an XML
115 interface we still have a very intimate connection with the
116 database without having to maintain a direct connection.
117 \end{itemize}
118
119 \section{Application overview}
120 \begin{figure}[H]
121 \label{appoverview}
122 \centering
123 \includegraphics[width=\linewidth]{appoverview.eps}
124 \strut\\
125 \caption{Overview of the application}
126 \end{figure}
127
128 \subsection{Frontend}
129 \subsubsection{General description}
130 The frontend is a web interface that is connected to the backend applications
131 which allows the user to interact with the backend. The frontend consists of a
132 basic graphical user interface which is shown in Figure~\ref{frontendfront}. As
133 the interface shows, there are three main components that the user can use.
134 There is also an button for downloading the XML. The \textit{Get xml} button is
135 a quick shortcut to make the backend to generate XML. The button for grabbing
136 the XML data is only for diagnostic purposes located there. In the standard
137 workflow the XML button is not used. In the standard workflow the server
138 periodically calls the XML output option from the command line interface of the
139 backend to process it.
140
141 \begin{figure}[H]
142 \label{frontendfront}
143 \includegraphics[scale=0.75,natheight=160,natwidth=657]{frontendfront.png}
144 \caption{The landing page of the frontend}
145 \end{figure}
146
147 \subsubsection{Edit/Remove crawler}
148 This component lets the user view the crawlers and remove the crawlers from the
149 database. Doing one of these things with a crawler is as simple as selecting
150 the crawler from the dropdown menu and selecting the operation from the
151 other dropdown menu and pressing \textit{Submit}.
152 Removing the crawler will remove the crawler completely from the crawler
153 database and the crawler will be unrecoverable. Editing the crawler will open a
154 similar screen as when adding the crawler. The details about that screen will
155 be discussed in ~\ref{addcrawler}. The only difference is that the previous
156 trained patterns are already made visible in the training interface and can
157 thus be adapted to change the crawler for possible source changes for example.
158
159 \subsubsection{Add new crawler}
160 \label{addcrawler}
161 The addition or generation of crawlers is the key feature of the program and it
162 is the smartest part of whole system as it includes the graph optimization
163 algorithm to recognize user specified patterns in the data. The user has to
164 assign a name to a RSS feed in the boxes and when the user presses submit the
165 RSS feed is downloaded and prepared to be shown in the interactive editor. The
166 editor consists of two components. The top most component allows the user to
167 enter several fields of data concerning the venue, these are things like:
168 address, crawling frequency and website. Below there is a table containing the
169 processed RSS feed entries and a row of buttons allowing the user to mark
170 certain parts of the entries as certain types. The user has to select a piece
171 of an entry and then press the appropriate category button. The text will
172 become highlighted and by doing this for several entries the program will have
173 enough information to crawl the feed as shown in Figure~\ref{addcrawl}
174
175 \begin{figure}[H]
176 \label{frontendfront}
177 \includegraphics[width=0.7\linewidth,natheight=1298,natwidth=584]{crawlerpattern.png}
178 \caption{A pattern selection of three entries}
179 \end{figure}
180
181 \subsubsection{Test crawler}
182 The test crawler component is a very simple non interactive component that
183 allows the user to verify if a crawler functions properly without having to
184 need to access the database or the command line utilities. Via a dropdown menu
185 the user selects the crawler and when submit is pressed the backend generates a
186 results page that shows a small log of the crawler, a summary of the results
187 and most importantly the results, in this way the user can see in a few gazes
188 if the crawler functions properly. Humans are very fast in detecting patterns
189 and therefore the error checking goes very fast. Because the log of the crawl
190 operation is shown this page can also be used for diagnostic information about
191 the backends crawling system. The logging is pretty in depth and also shows
192 possible exceptions and is therefore also usable for the developers to diagnose
193 problems.
194
195 \subsection{Backend}
196 \subsubsection{Program description}
197 The backend consists of a main module and a set of libraries all written in
198 \textit{Python}\cite{Python}. The main module can, and is, be embedded in an
199 apache webserver\cite{apache} via the \textit{mod\_python} apache
200 module\cite{Modpython}. The module \textit{mod\_python} allows the webserver to
201 execute Python code in the webserver. We chose Python because of the rich set
202 of standard libraries and solid cross platform capabilities. We chose Python 2
203 because it is still the default Python version on all major operating systems
204 and stays supported until at least the year 2020 meaning that the program can
205 function safe at least 5 full years. The application consists of a main Python
206 module that is embedded in the webserver. Finally there are some libraries and
207 there is a standalone program that does the periodic crawling.
208
209 \subsubsection{Main module}
210 The main module is the program that deals with the requests, controls the
211 fronted, converts the data to patterns and sends it to the crawler. The
212 module serves the frontend in a modular fashion. For example the buttons and
213 colors can be easily edited by a non programmer by just changing some values in
214 a text file. In this way even when conventions change the program can still
215 function without intervention of a programmer that needs to adapt the source.
216
217 \subsubsection{Libraries}
218 The libraries are called by the main program and take care of all the hard
219 work. Basically the libraries are a group of python scripts that for example
220 minimize the graphs, transform the user data into machine readable data, export
221 the crawled data to XML and much more.
222
223 \subsubsection{Standalone crawler}
224 The crawler is a program that is used by the main module and technically is
225 part of the libraries. The thing the crawler stands out is the fact that it
226 also can run on its own. The crawler has to be runned periodically by a server
227 to really crawl the websites. The main module communicates with the crawler
228 when it needs XML data, when a new crawler is added or when data is edited. The
229 crawler also offers a command line interface that has the same functionality as
230 the web interface of the control center.
231
232 The crawler saves all the data in a database. The database is a simple
233 dictionary where all the entries are hashed so that the crawler knows which
234 ones are already present in the database and which ones are new so that it
235 does not have to process all the old entries when they appear in the feed. The
236 RSS's GUID could also have been used but since it is an optional value in the
237 feed not every feed uses the GUID and therefore it is not reliable to use it.
238 The crawler also has a function to export the database to XML format. The XML
239 format is specified in an XSD\cite{Xsd} file for minimal ambiguity.
240
241 \subsubsection{XML \& XSD}
242 XML is a file format that can describe data structures. XML can be accompanied
243 by an XSD file that describes the format. An XSD file is in fact just another
244 XML file that describes the format of a class of XML files. Because almost all
245 programming languages have an XML parser built in it is a very versatile format
246 that makes the importing to the database very easy. The most used languages
247 also include XSD validation to detect XML errors, validity and completeness of
248 XML files. This makes interfacing with the database and possible future
249 programs very easy. The XSD scheme used for this programs output can be found
250 in the appendices in Listing~\ref{scheme.xsd}. The XML output can be queried
251 via a http interface that calls the crawler backend to crunch the latest
252 crawled data into XML.