updated images and makefile
[bsc-thesis1415.git] / thesis2 / 2.requirementsanddesign.tex
1 \section{Requirements}
2 \subsection{Introduction}
3 As almost every plan for an application starts with a set of requirements so
4 will this application. Requirements are a set of goals within different
5 categories that will define what the application has to be able to do and they
6 are traditionally defined at the start of the project and not expected to
7 change much. In the case of this project the requirements were a lot more
8 flexible because there was only one person doing the programming and there was
9 a weekly meeting to discuss the matters and most importantly discuss the
10 required changes. Because of this a lot of initial requirements are removed and
11 a some requirements were added in the process. The list below shows the
12 definitive requirements and also the suspended requirements.
13
14 There are two types of requirements, functional and non-functional
15 requirements. Functional requirements are requirements that describe a certain
16 function in the technical sense. Non-functional requirements describe a
17 property. Properties can be for example efficiency, portability or
18 compatibility. To make us able to refer to them later we give the
19 requirements unique codes. As for the definitive requirements a verbose
20 explanation is also provided.
21
22 \subsection{Functional requirements}
23 \subsubsection{Original functional requirements}
24 \begin{itemize}
25 \item[F1:] Be able to crawl several source types.
26 \begin{itemize}
27 \item[F1a:] Fax/email.
28 \item[F1b:] XML feeds.
29 \item[F1c:] RSS feeds.
30 \item[F1d:] Websites.
31 \end{itemize}
32 \item[F2:] Apply low level matching techniques on isolated data.
33 \item[F3:] Insert the data in the database.
34 \item[F4:] User interface to train crawlers that is usable someone
35 without a particular computer science background.
36 \item[F5:] Control center for the crawlers.
37 \item[F6:] Report to the user or maintainer when a source has been
38 changed too much for successful crawling.
39 \end{itemize}
40
41 \subsubsection{Definitive functional requirements}
42 Requirement F2 is the sole requirement that is dropped completely, this is
43 due to the fact that it lies outside of the time available for the project.
44 The less time available is partly because we chose to implement certain other
45 requirements like an interactive intuitive user interface around the core of
46 the pattern extraction program. All other requirements changed or kept the
47 same. Below are all definitive requirements with on the first line the title
48 and with a description underneath.
49 \begin{itemize}
50 \item[F7:] Be able to crawl RSS feeds.
51
52 This requirement is an adapted version of the compound
53 requirements F1a-F1d. We stripped down from crawling four
54 different sources to only one source because of the scope of
55 the project. Most sources require an entirely different
56 strategy and therefore we could not easily combine them. The
57 full reason why we chose RSS feeds can be found in
58 Section~\ref{sec:whyrss}.
59
60 \item[F8:] Export the data to a strict XML feed.
61
62 This requirement is an adapted version of requirement F3, this
63 is als done to make the scope smaller. We chose to no interact
64 with the database or the \textit{Temporum}. The application
65 however is able to output XML data that is formatted
66 following a string XSD scheme so that it is easy to import the
67 data in the database or \textit{Temporum}.
68 \item[F9:] User interface to train crawlers that is usable someone
69 without a particular computer science background.
70 science people.
71
72 This requirement is a combination of F4 and F5. At first the
73 user interface for adding and training crawlers was done via a
74 webinterface that was user friendly and usable by someone
75 without a particular computer science background as the
76 requirement stated. However in the first prototypes the control
77 center that could test, edit and remove crawlers was a command
78 line application and thus not very usable for the general
79 audience. This combined requirements asks for a single control
80 center that can do all previously described tasks with an
81 interface that is usable without prior knowledge in computer
82 science.
83 \item[F10:] Report to the user or maintainer when a source has been
84 changed too much for successful crawling.
85
86 This requirement was also present in the original requirements
87 and has not changed. When the crawler fails to crawl a source,
88 this can be due to any reason, a message is sent to the people
89 using the program so that they can edit or remove the faulty
90 crawler. Updating without the need of a programmer is essential
91 in shortening the feedbackloop explained in
92 Figure~\ref{feedbackloop}.
93 \end{itemize}
94
95 \subsection{Non-functional requirements}
96 \subsubsection{Original functional requirements}
97 \begin{itemize}
98 \item[N1:] Integrate in the original system.
99 \item[N2:] Work in a modular fashion, thus be able to, in the future,
100 extend the program.
101 \end{itemize}
102
103 \subsubsection{Active functional requirements}
104 \begin{itemize}
105 \item[N2:] Work in a modular fashion, thus be able to, in the future,
106 extend the program.
107
108 The modularity is very important so that the components can be
109 easily extended and components can be added. Possible
110 extensions are discussed in Section~\ref{sec:discuss}.
111 \item[N3:] Operate standalone on a server.
112
113 Non-functional requirement N1 is dropped because we want to
114 keep the program as modular as possible and via an XML
115 interface we still have a very intimate connection with the
116 database without having to maintain a direct connection.
117 \end{itemize}
118
119 \section{Application overview}
120 The workflow of the application can be divided into several components or
121 steps. The overview of the application is visible in Figure~\ref{appoverview}.
122 The nodes are applications or processing steps and the arrows denote
123 information flow or movement between nodes.
124 \begin{figure}[H]
125 \label{appoverview}
126 \centering
127 \includegraphics[width=\linewidth]{appoverview.pdf}
128 \strut\\
129 \caption{Overview of the application}
130 \end{figure}
131
132 \subsection{Frontend}
133 \subsubsection{General description}
134 The frontend is a web interface that is connected to the backend applications
135 which allows the user to interact with the backend. The frontend consists of a
136 basic graphical user interface which is shown in Figure~\ref{frontendfront}. As
137 the interface shows, there are three main components that the user can use.
138 There is also an button for downloading the XML. The \textit{Get xml} button is
139 a quick shortcut to make the backend to generate XML. The button for grabbing
140 the XML data is only for diagnostic purposes located there. In the standard
141 workflow the XML button is not used. In the standard workflow the server
142 periodically calls the XML output option from the command line interface of the
143 backend to process it.
144
145 \begin{figure}[H]
146 \label{frontendfront}
147 \includegraphics[width=\linewidth]{frontendfront.pdf}
148 \caption{The landing page of the frontend}
149 \end{figure}
150
151 \subsubsection{Edit/Remove crawler}
152 This component lets the user view the crawlers and remove the crawlers from the
153 crawler database. Doing one of these things with a crawler is as simple as
154 selecting the crawler from the dropdown menu and selecting the operation from
155 the other dropdown menu and pressing \textit{Submit}.
156
157 Removing the crawler will remove the crawler completely from the crawler
158 database and the crawler will be unrecoverable. Editing the crawler will open a
159 similar screen as when adding the crawler. The details about that screen will
160 be discussed in Section~\ref{addcrawler}. The only difference is that the
161 previous trained patterns are already made visible in the training interface
162 and can thus be adapted to change the crawler for possible source changes for
163 example.
164
165 \subsubsection{Add new crawler}
166 \label{addcrawler}
167 The addition or generation of crawlers is the key feature of the program and it
168 is the smartest part of whole system as it includes the graph optimization
169 algorithm to recognize user specified patterns in the new data. First, the user
170 must fill in the static form that is visible on top of the page. This for
171 contains general information about the venue together with some crawler
172 specific values such as crawling frequency. After that the user can mark
173 certain points in the table as being of a category. Marking text is as easy as
174 selecting the text and pressing the according button. The text visible in the
175 table is a stripped down version of the original RSS feeds \texttt{title} and
176 \texttt{summary} fields. When the text is marked it will be highlighted in the
177 same color as the color of the button text. The entirety of the user interface
178 with a few sample markings is shown in Figure~\ref{frontedfront}. After the
179 marking of the categories the user can preview the data or submit. Previewing
180 will run the crawler on the RSS feed in memory and the user can revise the
181 patterns if necessary. Submitting will send the page to the backend to be
182 processed. The internals of what happens after submitting is explained in
183 detail in Figure~\ref{appinternals} together with the text.
184
185 \begin{figure}[H]
186 \label{frontendfront}
187 \centering
188 \includegraphics[width=\linewidth]{crawlerpattern.pdf}
189 \caption{A view of the interface for specifying the pattern. Two %
190 entries are already marked.}
191 \end{figure}
192
193 \subsubsection{Test crawler}
194 The test crawler component is a very simple non interactive component that
195 allows the user to verify if a crawler functions properly without having to
196 access the database via the command line utilities. Via a dropdown menu the
197 user selects the crawler and when submit is pressed the backend generates a
198 results page that shows a small log of the crawler, a summary of the results
199 and most importantly the results itself. In this way the user can see in a few
200 gazes if the crawler functions properly. Humans are very fast in detecting
201 patterns and therefore the error checking goes very fast. Because the log of
202 the crawl operation is shown this page can also be used for diagnostic
203 information about the backends crawling system. The logging is pretty in depth
204 and also shows possible exceptions and is therefore also usable for the
205 developers to diagnose problems.
206
207 \subsection{Backend}
208 \subsubsection{Program description}
209 The backend consists of a main module and a set of libraries all written in
210 \textit{Python}\cite{Python}. The main module can, and is, be embedded in an
211 apache HTTP-server\cite{apache} via the \textit{mod\_python} apache
212 module\cite{Modpython}. The module \textit{mod\_python} allows handling for
213 python code via HTTP and this allows us to integrate neatly with the
214 \textit{Python} libraries. We chose \textit{Python} because of the rich set of
215 standard libraries and solid cross platform capabilities. We chose specifically
216 for \textit{Python} 2 because it is still the default \textit{Python} version
217 on all major operating systems and stays supported until at least the year
218 $2020$. This means that the program can function at least for 5 full years. The
219 application consists of a main \textit{Python} module that is embedded in the
220 HTTP-server. Finally there are some libraries and there is a standalone program
221 that does the periodic crawling.
222
223 \subsubsection{Main module}
224 The main module is the program that deals with the requests, controls the
225 frontend, converts the data to patterns and sends the patterns to the crawler.
226 The module serves the frontend in a modular fashion. For example the buttons
227 and colors can be easily edited by a non programmer by just changing the
228 appropriate values in a text file. In this way even when conventions change the
229 program can still function without intervention of a programmer that needs to
230 adapt the source.
231
232 \subsubsection{Libraries}
233 The libraries are called by the main program and take care of all the hard
234 work. Basically the libraries are a group of python scripts that for example
235 minimize the graphs, transform the user data into machine readable data, export
236 the crawled data to XML and much more.
237
238 \subsubsection{Standalone crawler}
239 The crawler is a program that is used by the main module and technically is
240 part of the libraries. The property in which the crawler stands out is the fact
241 that it also can run on its own. The crawler has to run periodically by a
242 server to literally crawl the websites. The main module communicates with the
243 crawler when it is queried for XML data, when a new crawler is added or when
244 data is edited. The crawler also offers a command line interface that has the
245 same functionality as the web interface of the control center.
246
247 The crawler saves all the data in a database. The database is a simple
248 dictionary where all the entries are hashed so that the crawler knows which
249 ones are already present in the database and which ones are new. In this way
250 the crawler does not have to process all the old entries when they appear in
251 the feed. The RSS' GUID could also have been used but since it is an optional
252 value in the feed not every feed uses the GUID and therefore it is not reliable
253 to use it. The crawler also has a function to export the database to XML
254 format. The XML output format is specified in an XSD\cite{Xsd} file for minimal
255 ambiguity.
256
257 \subsubsection{XML \& XSD}
258 XML is a file format that can describe data structures. XML can be accompanied
259 by an XSD file that describes the format. An XSD file is in fact just another
260 XML file that describes the format of XML files. Almost all programming
261 languages have an XML parser built in and therefore it is a very versatile
262 format that makes the eventual import to the database very easy. The most used
263 languages also include XSD validation to detect XML errors, validity and
264 completeness of XML files. This makes interfacing with the database and
265 possible future programs even more easily. The XSD scheme used for this
266 programs output can be found in the appendices in Listing~\ref{scheme.xsd}. The
267 XML output can be queried via the HTTP interface that calls the crawler backend
268 to crunch the latest crawled data into XML. It can also be acquired directly
269 from the crawlers command line interface.