\subsection{Functional requirements}
\subsubsection{Original functional requirements}
\begin{itemize}
- \item[F1:] Be able to crawl several source types.
+ \item[I1:] Be able to crawl several source types.
\begin{itemize}
- \item[F1a:] Fax/email.
- \item[F1b:] XML feeds.
- \item[F1c:] RSS feeds.
- \item[F1d:] Websites.
+ \item[I1a:] Fax/email.
+ \item[I1b:] XML feeds.
+ \item[I1c:] RSS feeds.
+ \item[I1d:] Websites.
\end{itemize}
- \item[F2:] Apply low level matching techniques on isolated data.
- \item[F3:] Insert the data in the database.
- \item[F4:] User interface to train crawlers that is usable someone
+ \item[I2:] Apply low level matching techniques on isolated data.
+ \item[I3:] Insert the data in the database.
+ \item[I4:] User interface to train crawlers that is usable someone
without a particular computer science background.
- \item[F5:] Control center for the crawlers.
- \item[F6:] Report to the user or maintainer when a source has been
- changed too much for successful crawling.
+ \item[I5:] The system should be able to report to the user or
+ maintainer when a source has been changed too much for
+ successful crawling.
\end{itemize}
\subsubsection{Definitive functional requirements}
-Requirement F2 is the sole requirement that is dropped completely, this is
-due to the fact that it lies outside of the time available for the project.
-The less time available is partly because we chose to implement certain other
-requirements like an interactive intuitive user interface around the core of
-the pattern extraction program. All other requirements changed or kept the
-same. Below are all definitive requirements with on the first line the title
-and with a description underneath.
+Requirement I2 is the one requirement that is dropped completely, this is due
+to time constraints. The time limitation is partly because we chose to
+implement certain other requirements like an interactive intuitive user
+interface around the core of the pattern extraction program. Below are all
+definitive requirements.
\begin{itemize}
- \item[F7:] Be able to crawl RSS feeds.
+ \item[F1:] The system should be able to crawl RSS feeds.
This requirement is an adapted version of the compound
- requirements F1a-F1d. We stripped down from crawling four
- different sources to only one source because of the scope of
- the project. Most sources require an entirely different
- strategy and therefore we could not easily combine them. The
- full reason why we chose RSS feeds can be found in
- Section~\ref{sec:whyrss}.
+ requirements I1a-I1d. We limited the source types to crawl to
+ strict RSS because of the time constraints of the project. Most
+ sources require an entirely different strategy and therefore we
+ could not easily combine them. an explanation why we chose RSS
+ feeds can be found in Section~\ref{sec:whyrss}.
- \item[F8:] Export the data to a strict XML feed.
+ \item[F2:] Export the data to a strict XML feed.
- This requirement is an adapted version of requirement F3, this
- is als done to make the scope smaller. We chose to no interact
- with the database or the \textit{Temporum}. The application
- however is able to output XML data that is formatted
- following a string XSD scheme so that it is easy to import the
- data in the database or \textit{Temporum}.
- \item[F9:] User interface to train crawlers that is usable someone
- without a particular computer science background.
- science people.
+ This requirement is an adapted version of requirement I3, this
+ is als done to limit the scope. We chose to no interact
+ directly with the database or the \textit{Temporum}. The
+ application however is able to output XML data that is
+ formatted following a string XSD scheme so that it is easy to
+ import the data in the database or \textit{Temporum} in a
+ indirect way.
+ \item[F3:] The system should have a user interface to create crawlers
+ that is usable someone without a particular computer science
+ background. science people.
- This requirement is a combination of F4 and F5. At first the
- user interface for adding and training crawlers was done via a
+ This requirement is formed from I4. Initially the user
+ interface for adding and training crawlers was done via a
webinterface that was user friendly and usable by someone
without a particular computer science background as the
requirement stated. However in the first prototypes the control
center that can do all previously described tasks with an
interface that is usable without prior knowledge in computer
science.
- \item[F10:] Report to the user or maintainer when a source has been
+ \item[F4:] Report to the user or maintainer when a source has been
changed too much for successful crawling.
This requirement was also present in the original requirements
\subsection{Non-functional requirements}
\subsubsection{Original functional requirements}
\begin{itemize}
- \item[N1:] Integrate in the original system.
- \item[N2:] Work in a modular fashion, thus be able to, in the future,
- extend the program.
+ \item[O1:] Integrate in the existing system used by Hyperleap.
+ \item[O2:] The system should work in a modular fashion, thus be able
+ to, in the future, extend the program.
\end{itemize}
-\subsubsection{Active functional requirements}
+\subsubsection{Definitive functional requirements}
\begin{itemize}
- \item[N2:] Work in a modular fashion, thus be able to, in the future,
+ \item[N1:] Work in a modular fashion, thus be able to, in the future,
extend the program.
The modularity is very important so that the components can be
easily extended and components can be added. Possible
extensions are discussed in Section~\ref{sec:discuss}.
- \item[N3:] Operate standalone on a server.
+ \item[N2:] Operate standalone on a server.
Non-functional requirement N1 is dropped because we want to
keep the program as modular as possible and via an XML
interface we still have a very intimate connection with the
- database without having to maintain a direct connection.
+ database without having to maintain a direct connection. The
+ downside of an indirect connection instead of a direct
+ connection is that the specification is much more rigid. If the
+ system changes the specification the backend program should
+ also change.
\end{itemize}
\section{Application overview}
\label{appoverview}
\centering
\includegraphics[width=\linewidth]{appoverview.pdf}
- \strut\\
\caption{Overview of the application}
\end{figure}
\subsection{Frontend}
\subsubsection{General description}
-The frontend is a web interface that is connected to the backend applications
-which allows the user to interact with the backend. The frontend consists of a
-basic graphical user interface which is shown in Figure~\ref{frontendfront}. As
-the interface shows, there are three main components that the user can use.
-There is also an button for downloading the XML. The \textit{Get xml} button is
-a quick shortcut to make the backend to generate XML. The button for grabbing
-the XML data is only for diagnostic purposes located there. In the standard
+The frontend is a web interface that is connected to the backend system which
+allows the user to interact with the backend. The frontend consists of a basic
+graphical user interface which is shown in Figure~\ref{frontendfront}. As the
+interface shows, there are three main components that the user can use. There
+is also an button for downloading the XML. The \textit{Get xml} button is a
+quick shortcut to make the backend to generate XML. The button for grabbing the
+XML data is only for diagnostic purposes located there. In the standard
workflow the XML button is not used. In the standard workflow the server
periodically calls the XML output option from the command line interface of the
backend to process it.
\subsubsection{Add new crawler}
\label{addcrawler}
The addition or generation of crawlers is the key feature of the program and it
-is the smartest part of whole system as it includes the graph optimization
+is the intelligent part of the system since it includes the graph optimization
algorithm to recognize user specified patterns in the new data. First, the user
must fill in the static form that is visible on top of the page. This for
contains general information about the venue together with some crawler
table is a stripped down version of the original RSS feeds \texttt{title} and
\texttt{summary} fields. When the text is marked it will be highlighted in the
same color as the color of the button text. The entirety of the user interface
-with a few sample markings is shown in Figure~\ref{frontedfront}. After the
+with a few sample markings is shown in Figure~\ref{frontendfront}. After the
marking of the categories the user can preview the data or submit. Previewing
will run the crawler on the RSS feed in memory and the user can revise the
patterns if necessary. Submitting will send the page to the backend to be
the crawled data to XML and much more.
\subsubsection{Standalone crawler}
-The crawler is a program that is used by the main module and technically is
-part of the libraries. The property in which the crawler stands out is the fact
-that it also can run on its own. The crawler has to run periodically by a
-server to literally crawl the websites. The main module communicates with the
-crawler when it is queried for XML data, when a new crawler is added or when
-data is edited. The crawler also offers a command line interface that has the
-same functionality as the web interface of the control center.
+The crawler is a program, also written in Python, that is used by the main
+module and technically is part of the libraries. The property in which the
+crawler stands out is the fact that it also can run on its own. The crawler has
+to run periodically by a server to literally crawl the websites. The main
+module communicates with the crawler when it is queried for XML data, when a
+new crawler is added or when data is edited. The crawler also offers a command
+line interface that has the same functionality as the web interface of the
+control center.
The crawler saves all the data in a database. The database is a simple
dictionary where all the entries are hashed so that the crawler knows which