c3ab98779019a813210258f8e2f406526a1252a4
[bsc-thesis1415.git] / thesis2 / 2.requirementsanddesign.tex
1 \section{Requirements}
2 \subsection{Introduction}
3 As almost every other computer starts with a set of requirements so will this
4 application. Requirements are a set of goals within different categories that
5 will define what the application has to be able to do and they are
6 traditionally defined at the start of the project and not expected to change
7 much. In the case of this project the requirements were a lot more flexible
8 because there was only one person doing the programming and there was a weekly
9 meeting to discuss the matters and most importantly discuss the required
10 changes. Because of this a lot of initial requirements are removed and a some
11 requirements were added in the process. The list below shows the definitive
12 requirements and also the suspended requirements.
13
14 The two types of requirements that are formed are functional and non-functional
15 requirements. Respectively they are requirements that describe a certain
16 function and the latter are requirements that describe a certain property such
17 as efficiency or compatibility. To make us able to refer to them we give the
18 requirements unique codes. We also specify in the list with active requirements
19 the reason for the choice.
20
21 \subsection{Functional requirements}
22 \subsubsection{Original functional requirements}
23 \begin{itemize}
24 \item[F1:] Be able to crawl several source types.
25 \begin{itemize}
26 \item[F1a:] Fax/email.
27 \item[F1b:] XML feeds.
28 \item[F1c:] RSS feeds.
29 \item[F1d:] Websites.
30 \end{itemize}
31 \item[F2:] Apply low level matching techniques on isolated data.
32 \item[F3:] Insert the data in the database.
33 \item[F4:] User interface to train crawlers must be usable by non computer
34 science people.
35 \item[F5:] There must be a control center for the crawlers.
36 \end{itemize}
37
38 \subsubsection{Definitive functional requirements}
39 Requirement F2 is the sole requirement that is dropped completely, this is
40 because this seemed to lie out of the scope of the project. This is mainly
41 because we chose to build an interactive intuitive user interface around the
42 core of the pattern extraction program. All other requirements changed or kept
43 the same. Below, all definitive requirements with on the first line the title
44 and with a description underneath.
45 \begin{itemize}
46 \item[F6:] Be able to crawl RSS feeds only.
47
48 This requirement is an adapted version of the compound requirements
49 F1a-F1d. We stripped down from crawling four different sources to only one
50 source because of the scope of the project. Most sources require an
51 entirely different strategy. The full reason why we chose RSS feeds can be
52 found in Section~\ref{sec:whyrss}.
53
54 \item[F7:] Export the data to a strict XML feed.
55
56 This requirement is an adapted version of requirement F3, this to done to
57 make the scope smaller. We chose to no interact with the database or the
58 \textit{Temporum}. The application will have to be able to output XML data
59 that is formatted following a string XSD scheme so that it is easy to
60 import the data in the database or \textit{Temporum}.
61 \item[F8:] A control center interface that is usable by non computer
62 science people.
63
64 This requirement is a combination of F4 and F5. At first the user interface
65 for adding and training crawlers was done via a webinterface that was user
66 friendly and usable for non computer science people as the requirement
67 stated. However in the first prototypes the control center that could test,
68 edit and remove crawlers was a command line application and thus not very
69 usable for the general audience. This combined requirements asks for a
70 single control center that can do all previously described task with an
71 interface that is usable by almost everyone.
72 \item[F9:] Report to the user or maintainer when a source has been changed
73 too much for successful crawling.
74
75 This requirement was also present in the original requirements and hasn't
76 changed. When the crawler fails to crawl a source, this can be due to any
77 reason, a message is sent to the people using the program so that they can
78 edit or remove the faulty crawler. This is a crucial component because the
79 program, a non computer science person can do this task and is essential in
80 shortening the feedback loop explained in Figure~\ref{fig:1.1.2}.
81 \end{itemize}
82
83 \subsection{Non-functional requirements}
84 \subsubsection{Original functional requirements}
85 \begin{itemize}
86 \item[N1:] Integrate in the original system.
87 \item[N2:] Work in a modular fashion, thus be able to, in the future, extend
88 the program.
89 \end{itemize}
90
91 \subsubsection{Active functional requirements}
92 \begin{itemize}
93 \item[N2:] Work in a modular fashion, thus be able to, in the future, extend
94 the program.
95
96 The modularity is very important so that the components can be easily
97 extended and components can be added. Possible extensions are discussed in
98 Section~\ref{sec:discuss}.
99 \item[N3:] Operate standalone on a server.
100
101 Non-functional requirement N1 is dropped because we want to keep the
102 program as modular as possible and via an XML interface we still have a
103 very stable connection with the database but we avoided getting entangled
104 in the software managing the database.
105 \end{itemize}
106
107 \section{Design}
108 \subsection{Frontend}
109 \subsubsection{General description}
110 The frontend is a web interface to the backend applications that allow the user
111 to interact with the backend by for example adding crawlers. The frontend
112 consists of a basic graphical user interface that is shown in
113 Figure~\ref{frontendfront}. As the interface shows, there are three main
114 components that the user can use. There is also an button for downloading the
115 XML. The XML output is a quick shortcut to make the backend to generate XML.
116 However the XML button is only for diagnostic purposes located there. In the
117 standard workflow the XML button is not used. In the standard workflow the
118 server periodically calls the XML output from the backend to process it.
119
120 \begin{figure}[H]
121 \caption{The landing page of the frontend}
122 \label{frontendfront}
123 \includegraphics[scale=0.75,natheight=160,natwidth=657]{frontendfront.png}
124 \end{figure}
125
126 \subsubsection{Edit/Remove crawler}
127 This component lets the user view the crawlers and remove the crawlers from the
128 database. Removing the crawler is as simple as selecting it from the dropdown
129 list and pressing the remove button. Editing the crawler is done in the same
130 fashion but then pressing the edit button. The editing of the crawlers is
131 basically the same as adding a new crawler other then that the previous pattern
132 is already visible and can be adapted if for example the structure has changed.
133
134 \subsubsection{Add new crawler}
135 \subsubsection{Test crawler}
136
137 \subsection{Backend}
138 \subsubsection{Program description}
139 The backend consists of a main module and a set of libraries all written in
140 \textit{Python}\cite{Python}. The main module can,
141 and is, be embedded in an apache
142 webserver\footnote{\url{https://httpd.apache.org/}} via the
143 \textit{mod\_python} apache module\cite{Modpython}. The module
144 \textit{mod\_python} allows the webserver to execute Python code in
145 the webserver. We chose Python because of the rich set of standard libraries
146 and solid cross platform capabilities. We chose Python 2 because it is still
147 the default Python version on all major operating systems and stays supported
148 until at least the year 2020 meaning that the program can function safe at
149 least 5 full years. The application consists of a main Python module that is
150 embedded in the webserver. Finally there are some libraries and there is a
151 standalone program that does the periodic crawling.
152
153 \subsubsection{Main module}
154 The main module is the program that deals with the requests, controls the
155 fronted, converts the data to patterns and sends it to the crawler. The
156 module serves the frontend in a modular fashion. For example the buttons and
157 colors can be easily edited by a non programmer by just changing some values in
158 a text file. In this way even when conventions change the program can still
159 function without intervention of a programmer that needs to adapt the source.
160
161 \subsubsection{Libraries}
162 The libraries are called by the main program and take care of all the hard
163 work. Basically the libaries are a group of python scripts that for example
164 minimize the graphs, transform the user data into machine readable data, export
165 the crawled data to XML and much more.
166
167 \subsubsection{Standalone crawler}
168 The crawler is a program that is used by the main module and technically is
169 part of the libraries. The thing the crawler stands out is the fact that it
170 also can run on its own. The crawler has to be runned periodically by a server
171 to really crawl the websites. The main module communicates with the crawler
172 when it needs XML data, when a new crawler is added or when data is edited. The
173 crawler also offers a command line interface that has the same functionality as
174 the web interface of the control center.
175
176 The crawler saves all the data in a database. The database is a simple
177 dictionary where all the entries are hashed so that the crawler knows which
178 ones are already present in the database and which ones are new so that it
179 does not have to process all the old entries when they appear in the feed. The
180 RSS's GUID could also have been used but since it is an optional value in the
181 feed not every feed uses the GUID and therefore it is not reliable to use it.
182 The crawler also has a function to export the database to XML format. The XML
183 format is specified in an XSD\cite{Xsd} file for minimal ambiguity.
184
185 An XSD file is a file that precisely describes the field the XML file uses.
186 As almost every programming language contains an XML library most of the
187 languages also contain a XSD library that allows the XML library to parse files
188 according to the scheme and in that way the programmer knows exactly what to
189 expect as XML fields. The XSD file for this program can be found in the
190 appendices in Listing~\ref{scheme.xsd}.