typos in 1 and 2 fixed
[bsc-thesis1415.git] / thesis2 / 2.requirementsanddesign.tex
1 \section{Requirements}
2 \subsection{Introduction}
3 As almost every plan for an application starts with a set of requirements so
4 will this application. Requirements are a set of goals within different
5 categories that will define what the application has to be able to do and they
6 are traditionally defined at the start of the project and not expected to
7 change much. In the case of this project the requirements were a lot more
8 flexible because there was only one person doing the programming and there was
9 a weekly meeting to discuss the matters and most importantly discuss the
10 required changes. Because of this a lot of initial requirements are removed and
11 a some requirements were added in the process. The list below shows the
12 definitive requirements and also the suspended requirements.
13
14 There are two types of requirements, functional and non-functional
15 requirements. Functional requirements are requirements that describe a certain
16 function in the technical sense. Non-functional requirements describe a
17 property. Properties can be for example efficiency, portability or
18 compatibility. To make us able to refer to them later we give the
19 requirements unique codes. As for the definitive requirements a verbose
20 explanation is also provided.
21
22 \subsection{Functional requirements}
23 \subsubsection{Original functional requirements}
24 \begin{itemize}
25 \item[I1:] The system should be able to crawl several source types.
26 \begin{itemize}
27 \item[I1a:] Fax/email.
28 \item[I1b:] XML feeds.
29 \item[I1c:] RSS feeds.
30 \item[I1d:] Websites.
31 \end{itemize}
32 \item[I2:] Apply low level matching techniques on isolated data.
33 \item[I3:] Insert data in the database.
34 \item[I4:] The system should have an user interface to train crawlers that is
35 usable someone without a particular computer science background.
36 \item[I5:] The system should be able to report to the user or
37 maintainer when a source has been changed too much for
38 successful crawling.
39 \end{itemize}
40
41 \subsubsection{Definitive functional requirements}
42 Requirement I2 is the one requirement that is dropped completely, this is due
43 to time constraints. The time limitation is partly because we chose to
44 implement certain other requirements like an interactive intuitive user
45 interface around the core of the pattern extraction program. Below are all
46 definitive requirements.
47 \begin{itemize}
48 \item[F1:] The system should be able to crawl RSS feeds.
49
50 This requirement is an adapted version of the compound
51 requirements I1a-I1d. We limited the source types to crawl to
52 strict RSS because of the time constraints of the project. Most
53 sources require an entirely different strategy and therefore we
54 could not easily combine them. an explanation why we chose RSS
55 feeds can be found in Section~\ref{sec:whyrss}.
56
57 \item[F2:] Export the data to a strict XML feed.
58
59 This requirement is an adapted version of requirement I3, this
60 is also done to limit the scope. We chose to no interact
61 directly with the database or the \textit{Temporum}. The
62 application however is able to output XML data that is
63 formatted following a string XSD scheme so that it is easy to
64 import the data in the database or \textit{Temporum} in a
65 indirect way.
66 \item[F3:] The system should have a user interface to create crawlers
67 that is usable someone without a particular computer science
68 background. science people.
69
70 This requirement is formed from I4. Initially the user
71 interface for adding and training crawlers was done via a
72 web interface that was user friendly and usable by someone
73 without a particular computer science background as the
74 requirement stated. However in the first prototypes the control
75 center that could test, edit and remove crawlers was a command
76 line application and thus not very usable for the general
77 audience. This combined requirements asks for a single control
78 center that can do all previously described tasks with an
79 interface that is usable without prior knowledge in computer
80 science.
81 \item[F4:] Report to the user or maintainer when a source has been
82 changed too much for successful crawling.
83
84 This requirement was also present in the original requirements
85 and has not changed. When the crawler fails to crawl a source,
86 this can be due to any reason, a message is sent to the people
87 using the program so that they can edit or remove the faulty
88 crawler. Updating without the need of a programmer is essential
89 in shortening the feedback loop explained in
90 Figure~\ref{feedbackloop}.
91 \end{itemize}
92
93 \subsection{Non-functional requirements}
94 \subsubsection{Original functional requirements}
95 \begin{itemize}
96 \item[O1:] Integrate in the existing system used by Hyperleap.
97 \item[O2:] The system should work in a modular fashion, thus be able
98 to, in the future, extend the program.
99 \end{itemize}
100
101 \subsubsection{Definitive functional requirements}
102 \begin{itemize}
103 \item[N1:] Work in a modular fashion, thus be able to, in the future,
104 extend the program.
105
106 The modularity is very important so that the components can be
107 easily extended and components can be added. Possible
108 extensions are discussed in Section~\ref{sec:discuss}.
109 \item[N2:] Operate standalone on a server.
110
111 Non-functional requirement O1 is dropped because we want to
112 keep the program as modular as possible and via an XML
113 interface we still have a very intimate connection with the
114 database without having to maintain a direct connection. The
115 downside of an indirect connection instead of a direct
116 connection is that the specification is much more rigid. If the
117 system changes the specification the backend program should
118 also change.
119 \end{itemize}
120
121 \section{Application overview}
122 The workflow of the application can be divided into several components or
123 steps. The overview of the application is visible in Figure~\ref{appoverview}.
124 The nodes are applications or processing steps and the arrows denote
125 information flow or movement between nodes.
126 \begin{figure}[H]
127 \label{appoverview}
128 \centering
129 \includegraphics[width=\linewidth]{appoverview.pdf}
130 \caption{Overview of the application}
131 \end{figure}
132
133 \subsection{Frontend}
134 \subsubsection{General description}
135 The frontend is a web interface that is connected to the backend system which
136 allows the user to interact with the backend. The frontend consists of a basic
137 graphical user interface which is shown in Figure~\ref{frontendfront}. As the
138 interface shows, there are three main components that the user can use. There
139 is also an button for downloading the XML. The \textit{Get xml} button is a
140 quick shortcut to make the backend to generate XML. The button for grabbing the
141 XML data is only for diagnostic purposes located there. In the standard
142 workflow the XML button is not used. In the standard workflow the server
143 periodically calls the XML output option from the command line interface of the
144 backend to process it.
145
146 \begin{figure}[H]
147 \label{frontendfront}
148 \includegraphics[width=\linewidth]{frontendfront.pdf}
149 \caption{The landing page of the frontend}
150 \end{figure}
151
152 \subsubsection{Edit/Remove crawler}
153 This component lets the user view the crawlers and remove the crawlers from the
154 crawler database. Doing one of these things with a crawler is as simple as
155 selecting the crawler from the dropdown menu and selecting the operation from
156 the other dropdown menu and pressing \textit{Submit}.
157
158 Removing the crawler will remove the crawler completely from the crawler
159 database and the crawler will be unrecoverable. Editing the crawler will open a
160 similar screen as when adding the crawler. The details about that screen will
161 be discussed in Section~\ref{addcrawler}. The only difference is that the
162 previous trained patterns are already made visible in the training interface
163 and can thus be adapted to change the crawler for possible source changes for
164 example.
165
166 \subsubsection{Add new crawler}
167 \label{addcrawler}
168 The addition or generation of crawlers is the key feature of the program and it
169 is the intelligent part of the system since it includes the graph optimization
170 algorithm to recognize user specified patterns in the new data. First, the user
171 must fill in the static form that is visible on top of the page. This for
172 contains general information about the venue together with some crawler
173 specific values such as crawling frequency. After that the user can mark
174 certain points in the table as being of a category. Marking text is as easy as
175 selecting the text and pressing the according button. The text visible in the
176 table is a stripped down version of the original RSS feeds \texttt{title} and
177 \texttt{summary} fields. When the text is marked it will be highlighted in the
178 same color as the color of the button text. The entirety of the user interface
179 with a few sample markings is shown in Figure~\ref{frontendfront}. After the
180 marking of the categories the user can preview the data or submit. Previewing
181 will run the crawler on the RSS feed in memory and the user can revise the
182 patterns if necessary. Submitting will send the page to the backend to be
183 processed. The internals of what happens after submitting is explained in
184 detail in Figure~\ref{appinternals} together with the text.
185
186 \begin{figure}[H]
187 \label{frontendfront}
188 \centering
189 \includegraphics[width=\linewidth]{crawlerpattern.pdf}
190 \caption{A view of the interface for specifying the pattern. Two %
191 entries are already marked.}
192 \end{figure}
193
194 \subsubsection{Test crawler}
195 The test crawler component is a very simple non interactive component that
196 allows the user to verify if a crawler functions properly without having to
197 access the database via the command line utilities. Via a dropdown menu the
198 user selects the crawler and when submit is pressed the backend generates a
199 results page that shows a small log of the crawler, a summary of the results
200 and most importantly the results itself. In this way the user can see in a few
201 gazes if the crawler functions properly. Humans are very fast in detecting
202 patterns and therefore the error checking goes very fast. Because the log of
203 the crawl operation is shown this page can also be used for diagnostic
204 information about the backends crawling system. The logging is pretty in depth
205 and also shows possible exceptions and is therefore also usable for the
206 developers to diagnose problems.
207
208 \subsection{Backend}
209 \subsubsection{Program description}
210 The backend consists of a main module and a set of libraries all written in
211 \textit{Python}\cite{Python}. The main module can, and is, be embedded in an
212 apache HTTP-server\cite{apache} via the \textit{mod\_python} apache
213 module\cite{Modpython}. The module \textit{mod\_python} allows handling for
214 python code via HTTP and this allows us to integrate neatly with the
215 \textit{Python} libraries. We chose \textit{Python} because of the rich set of
216 standard libraries and solid cross platform capabilities. We chose specifically
217 for \textit{Python} 2 because it is still the default \textit{Python} version
218 on all major operating systems and stays supported until at least the year
219 $2020$. This means that the program can function at least for 5 full years. The
220 application consists of a main \textit{Python} module that is embedded in the
221 HTTP-server. Finally there are some libraries and there is a standalone program
222 that does the periodic crawling.
223
224 \subsubsection{Main module}
225 The main module is the program that deals with the requests, controls the
226 frontend, converts the data to patterns and sends the patterns to the crawler.
227 The module serves the frontend in a modular fashion. For example the buttons
228 and colors can be easily edited by a non programmer by just changing the
229 appropriate values in a text file. In this way even when conventions change the
230 program can still function without intervention of a programmer that needs to
231 adapt the source.
232
233 \subsubsection{Libraries}
234 The libraries are called by the main program and take care of all the hard
235 work. Basically the libraries are a group of python scripts that for example
236 minimize the graphs, transform the user data into machine readable data, export
237 the crawled data to XML and much more.
238
239 \subsubsection{Standalone crawler}
240 The crawler is a program, also written in Python, that is used by the main
241 module and technically is part of the libraries. The property in which the
242 crawler stands out is the fact that it also can run on its own. The crawler has
243 to run periodically by a server to literally crawl the websites. The main
244 module communicates with the crawler when it is queried for XML data, when a
245 new crawler is added or when data is edited. The crawler also offers a command
246 line interface that has the same functionality as the web interface of the
247 control center.
248
249 The crawler saves all the data in a database. The database is a simple
250 dictionary where all the entries are hashed so that the crawler knows which
251 ones are already present in the database and which ones are new. In this way
252 the crawler does not have to process all the old entries when they appear in
253 the feed. The RSS' GUID could also have been used but since it is an optional
254 value in the feed not every feed uses the GUID and therefore it is not reliable
255 to use it. The crawler also has a function to export the database to XML
256 format. The XML output format is specified in an XSD\cite{Xsd} file for minimal
257 ambiguity.
258
259 \subsubsection{XML \& XSD}
260 XML is a file format that can describe data structures. XML can be accompanied
261 by an XSD file that describes the format. An XSD file is in fact just another
262 XML file that describes the format of XML files. Almost all programming
263 languages have an XML parser built in and therefore it is a very versatile
264 format that makes the eventual import to the database very easy. The most used
265 languages also include XSD validation to detect XML errors, validity and
266 completeness of XML files. This makes interfacing with the database and
267 possible future programs even more easily. The XSD scheme used for this
268 programs output can be found in the appendices in Listing~\ref{scheme.xsd}. The
269 XML output can be queried via the HTTP interface that calls the crawler backend
270 to crunch the latest crawled data into XML. It can also be acquired directly
271 from the crawlers command line interface.