final
[bsc-thesis1415.git] / thesis2 / 2.requirementsanddesign.tex
1 \section{Requirements}
2 \subsection{Introduction}
3 As almost every plan for an application starts with a set of requirements so
4 will this application. Requirements are a set of goals within different
5 categories that will define what the application has to be able to do and they
6 are traditionally defined at the start of the project and not expected to
7 change much. In the case of this project the requirements were a lot more
8 flexible because there was only one person doing the programming and there was
9 a weekly meeting to discuss the matters and most importantly discuss the
10 required changes. Because of this a lot of initial requirements are removed and
11 a some requirements were added in the process. The list below shows the
12 definitive requirements and also the suspended requirements.
13
14 There are two types of requirements, functional and non-functional
15 requirements. Functional requirements are requirements that describe a certain
16 function in the technical sense. Non-functional requirements describe a
17 property. Properties can be for example efficiency, portability or
18 compatibility. To make us able to refer to them later we give the
19 requirements unique codes. As for the definitive requirements a verbose
20 explanation is also provided.
21
22 \subsection{Functional requirements}
23 \subsubsection{Original functional requirements}
24 \begin{itemize}
25 \item[I1:] The system should be able to crawl several source types.
26 \begin{itemize}
27 \item[I1a:] Fax/email.
28 \item[I1b:] XML feeds.
29 \item[I1c:] RSS feeds.
30 \item[I1d:] Websites.
31 \end{itemize}
32 \item[I2:] Apply low level matching techniques on isolated data.
33 \item[I3:] Insert data in the database.
34 \item[I4:] The system should have a user interface to train crawlers
35 that is usable by someone without a particular computer science
36 background.
37 \item[I5:] The system should be able to report to the employee when a
38 source has been changed too much for successful crawling.
39 \end{itemize}
40
41 \subsubsection{Definitive functional requirements}
42 Requirement I2 is the one requirement that is dropped completely, this is due
43 to time constraints. The time limitation is partly because we chose to
44 implement certain other requirements like an interactive intuitive user
45 interface around the core of the pattern extraction program. Below are all
46 definitive requirements.
47 \begin{itemize}
48 \item[F1:] The system should be able to crawl RSS feeds.
49
50 This requirement is an adapted version of the compound
51 requirements I1a-I1d. We limited the source types to crawl to
52 strict RSS because of the time constraints of the project. Most
53 sources require an entirely different strategy and therefore we
54 could not easily combine them. An explanation why we chose RSS
55 feeds can be found in Section~\ref{sec:whyrss}.
56
57 \item[F2:] Export the data to a strict XML feed.
58
59 This requirement is an adapted version of requirement I3, this
60 is also done to limit the scope. We chose to not interact
61 directly with the database or the \textit{Temporum}. The
62 application however is able to output XML data that is
63 formatted following a string XSD scheme so that it is easy to
64 import the data in the database or \textit{Temporum} in a
65 indirect way.
66 \item[F3:] The system should have a user interface to create crawlers
67 that is usabl for someone without a particular computer science
68 background.
69
70 This requirement is formed from I4. Initially the user
71 interface for adding and training crawlers was done via a
72 web interface that was user friendly and usable by someone
73 without a particular computer science background as the
74 requirement stated. However in the first prototypes the control
75 center that could test, edit and remove crawlers was a command
76 line application and thus not very usable for the general
77 audience. This combined requirements asks for a single control
78 center that can do all previously described tasks with an
79 interface that is usable without prior knowledge in computer
80 science.
81 \item[F4:] Report to the user or maintainer when a source has been
82 changed too much for successful crawling.
83
84 This requirement was also present in the original requirements
85 and has not changed. When the crawler fails to crawl a source,
86 this can be due to any reason, a message is sent to the people
87 using the program so that they can edit or remove the faulty
88 crawler. Updating without the need of a programmer is essential
89 in shortening the feedback loop explained in
90 Figure~\ref{feedbackloop}.
91 \end{itemize}
92
93 \subsection{Non-functional requirements}
94 \subsubsection{Original functional requirements}
95 \begin{itemize}
96 \item[O1:] Integrate in the existing system used by Hyperleap.
97 \item[O2:] The system should work in a modular fashion, thus be able
98 to, in the future, extend the program.
99 \end{itemize}
100
101 \subsubsection{Definitive functional requirements}
102 \begin{itemize}
103 \item[N1:] Work in a modular fashion, thus be able to, in the future,
104 extend the program.
105
106 The modularity is very important so that the components can be
107 easily extended and components can be added. Possible
108 extensions are discussed in Section~\ref{sec:discuss}.
109 \item[N2:] Operate standalone on a server.
110
111 Non-functional requirement O1 is dropped because we want to
112 keep the program as modular as possible and via an XML
113 interface we still have a very intimate connection with the
114 database without having to maintain a direct connection. The
115 downside of an indirect connection instead of a direct
116 connection is that the specification is much more rigid. If the
117 system changes the specification the backend program should
118 also change.
119 \end{itemize}
120
121 \section{Application overview}
122 The workflow of the application can be divided into several components or
123 steps. The overview of the application is visible in Figure~\ref{appoverview}.
124 The nodes are applications or processing steps and the arrows denote
125 information flow or movement between nodes.
126 \begin{figure}[H]
127 \centering
128 \includegraphics[width=\linewidth]{appoverview.pdf}
129 \caption{Overview of the application\label{appoverview}}
130 \end{figure}
131
132 \subsection{Frontend}
133 \subsubsection{General description}
134 The frontend is a web interface that is connected to the backend system which
135 allows the user to interact with the backend. The frontend consists of a basic
136 graphical user interface which is shown in Figure~\ref{frontendfront}. As the
137 interface shows, there are three main components that the user can use. There
138 is also a button for downloading the XML\@. The \textit{Get xml} button is a
139 quick shortcut to make the backend to generate XML\@. The button for grabbing the
140 XML data is only for diagnostic purposes located there. In the standard
141 workflow the XML button is not used. In the standard workflow the server
142 periodically calls the XML output option from the command line interface of the
143 backend to process it.
144
145 \begin{figure}[H]
146 \includegraphics[width=\linewidth]{frontendfront.pdf}
147 \caption{The landing page of the frontend\label{frontendfront}}
148 \end{figure}
149
150 \subsubsection{Repair/Remove crawler}
151 This component lets the user view the crawlers and remove the crawlers from the
152 crawler database. Doing one of these things with a crawler is as simple as
153 selecting the crawler from the dropdown menu and selecting the operation from
154 the other dropdown menu and pressing \textit{Submit}.
155
156 Removing the crawler will remove the crawler completely from the crawler
157 database and the crawler will be unrecoverable. Editing the crawler will open a
158 similar screen as when adding the crawler. The details about that screen will
159 be discussed in Section~\ref{addcrawler}. The only difference is that the
160 previous trained patterns are already made visible in the training interface
161 and can thus be adapted to change the crawler for possible source changes for
162 example.
163
164 \subsubsection{Add new crawler}
165 \label{addcrawler}
166 The addition or generation of crawlers is the key feature of the program and it
167 is the intelligent part of the system since it includes the graph optimization
168 algorithm to recognize user specified patterns in the new data. First, the user
169 must fill in the static form that is visible on top of the page. This for
170 example contains general information about the venue together with some crawler
171 specific values such as crawling frequency. After that the user can mark
172 certain points in the table as being of a category. Marking text is as easy as
173 selecting the text and pressing the according button. The text visible in the
174 table is a stripped down version of the original RSS feeds \texttt{title} and
175 \texttt{summary} fields. When the text is marked it will be highlighted in the
176 same color as the color of the button text. The entirety of the user interface
177 with a few sample markings is shown in Figure~\ref{frontendfront}. After the
178 marking of the categories the user can preview the data or submit. Previewing
179 will run the crawler on the RSS feed in memory and the user can revise the
180 patterns if necessary. Submitting will send the page to the backend to be
181 processed. The internals of what happens after submitting is explained in
182 detail in Figure~\ref{appinternals} together with the text.
183
184 \begin{figure}[H]
185 \centering
186 \includegraphics[width=\linewidth]{crawlerpattern.pdf}
187 \caption{A view of the interface for specifying the pattern. Two %
188 entries are already marked.\label{frontendfront}}
189 \end{figure}
190
191 \subsubsection{Test crawler}
192 The test crawler component is a very simple non interactive component that
193 allows the user to verify if a crawler functions properly without having to
194 access the database via the command line utilities. Via a dropdown menu the
195 user selects the crawler and when submit is pressed the backend generates a
196 results page that shows a small log of the crawler, a summary of the results
197 and most importantly the results itself. In this way the user can see in a few
198 gazes if the crawler functions properly. Humans are very fast in detecting
199 patterns and therefore the error checking goes very fast. Because the log of
200 the crawl operation is shown this page can also be used for diagnostic
201 information about the backends crawling system. The logging is in depth and
202 also shows possible exceptions and is therefore also usable for the developers
203 to diagnose problems.
204
205 \subsection{Backend}
206 \subsubsection{Program description}
207 The backend consists of a main module and a set of libraries all written in
208 \textit{Python}\cite{Python}. The main module is embedded in an apache
209 HTTP-server\cite{apache} via the \textit{mod\_python} apache
210 module\cite{Modpython}. The module \textit{mod\_python} allows handling for
211 python code via HTTP and this allows us to integrate neatly with the
212 \textit{Python} libraries. We chose \textit{Python} because of the rich set of
213 standard libraries and solid cross platform capabilities. We chose specifically
214 for \textit{Python} 2 because it is still the default \textit{Python} version
215 on all major operating systems and stays supported until at least the year
216 $2020$. This means that the program can function at least for 5 full years. The
217 application consists of a main \textit{Python} module that is embedded in the
218 HTTP-server. Finally there are some libraries and there is a standalone program
219 that does the periodic crawling.
220
221 \subsubsection{Main module}
222 The main module is the program that deals with the requests, controls the
223 frontend, converts the data to patterns and sends the patterns to the crawler.
224 The module serves the frontend in a modular fashion. For example the buttons
225 and colors can be easily edited by a non programmer by just changing the
226 appropriate values in a text file. In this way even when conventions change the
227 program can still function without intervention of a programmer that needs to
228 adapt the source.
229
230 \subsubsection{Libraries}
231 The libraries are called by the main program and take care of all the hard
232 work. Basically the libraries are a group of python scripts that for example
233 minimize the graphs, transform the user data into machine readable data, export
234 the crawled data to XML and much more.
235
236 \subsubsection{Standalone crawler}
237 The crawler is a program, also written in Python, that is used by the main
238 module and technically is part of the libraries. The property in which the
239 crawler stands out is the fact that it also can run on its own. The crawler has
240 to run periodically by a server to literally crawl the websites. The main
241 module communicates with the crawler when it is queried for XML data, when a
242 new crawler is added or when data is edited. The crawler also offers a command
243 line interface that has the same functionality as the web interface of the
244 control center.
245
246 The crawler saves all the data in a database. The database is a simple
247 dictionary where all the entries are hashed so that the crawler knows which
248 ones are already present in the database and which ones are new. In this way
249 the crawler does not have to process all the old entries when they appear in
250 the feed. The RSS' GUID could also have been used but since it is an optional
251 value in the feed not every feed uses the GUID and therefore it is not reliable
252 to use it. The crawler also has a function to export the database to XML
253 format. The XML output format is specified in an XSD\cite{Xsd} file for minimal
254 ambiguity.
255
256 \subsubsection{XML \& XSD}
257 XML is a file format that can describe data structures. XML can be accompanied
258 by an XSD file that describes the format. An XSD file is in fact just another
259 XML file that describes the format of XML files. Almost all programming
260 languages have an XML parser built in and therefore it is a very versatile
261 format that makes the eventual import to the database very easy. The most used
262 languages also include XSD validation to detect XML errors, validity and
263 completeness of XML files. This makes interfacing with the database and
264 possible future programs even more easy. The XSD scheme used for this
265 programs output can be found in the appendices in Algorithm~\ref{scheme.xsd}. The
266 XML output can be queried via the HTTP interface that calls the crawler backend
267 to crunch the latest crawled data into XML\@. It can also be acquired directly
268 from the crawlers command line interface.